ForestDB: Directory Tree Database

2010/07/28 15:48
mikio

I've released Kyoto Cabinet 1.2.0, which supports B+ tree on the directory database.

What is this?

The directory database (DirDB), which appeared in the last version, stores records as respective files in a directory. Because it inherits the abstract database interface, the same methods as the file hash database (HashDB) are provided.

By the way, the file tree database (TreeDB) is a B+ tree implementation on HashDB. Thus, I hit on the idea that it is possible to implement B+ tree on DirDB instead of HashDB. I name this implementation ForestDB. To be distinguished, DirDB is described as the directory hash database from now on.

Implementation

The whole implementation of TreeDB was written in the file kctreedb.h. In such case, I can get the template class from the existing implementation by replacing embedded class names "HashDB" into the symbols "BASEDB" and putting "template <class BASEDB>" in front of the class definition. Actually, I have to complement the inner class expressions with "typename". I found C++ is very effective to make templates easily.

I named the template PlantDB. Now, the definitions of TreeDB and ForestDB are very simple as the following.

typedef PlantDB<HashDB> TreeDB;
typedef PlantDB<DirDB> ForestDB;

This means great improvement. Now, it is easy to evolve any data structure like an associative array into B+ tree based on the original structure. Although I have to implement not only set/get/remove methods but also open/close and so on, they can be dummy if without no meanings.

Performance

When I built ForestDB and run the test cases, surprisingly, it passed at the first attempt. Moreover, the performance of ForestDB exceeds that of TreeDB. I can't understand why the performance improves although the number of files increases. I did performance tests to handle one million records. The key of each record is 8 bytes and its value is also 8 bytes. The file system is EXT3 (writeback mode).

write get remove
HashDB 0.747 0.765 0.675
DirDB 149.592 207.701 215.489
TreeDB 0.727 0.909 0.828
ForestDB 0.709 0.846 0.735

DirDB is 200 times slower than HashDB. However, ForestDB based on DirDB is a little faster than TreeDB based on HashDB. It is wondering.

I did another test for TreeDB and ForestDB, with one hundred million records. I set the page size of B+ tree 65536.

write get remove
TreeDB 82.625 98.071 107.854
ForestDB 105.450 100.412 112.348

In this test, TreeDB was better than ForestDB. Although I can't account for the reason, I felt happy to see that TreeDB surpassed ForestDB in the test with a large amount of records. Because HashDB and TreeDB is the core competency of Kyoto Cabinet.

Current Features of KC

Now that DirDB and ForestDB have been added, KC's database features are very various. I hope that they can cover a lot of popular usages.

name complexity algorithm lock persistence usage
ProtoHashDB O(1) hash table whole(rwlock) volatile none (just for test)
ProtoTreeDB O(log N) red black tree whole (rwlock) volatile ordered records on memory
CacheDB O(1) hash table record (mutex) volatile cache data on memory
HashDB O(1) hash table record (rwlock) persistent small but numerous meta data
TreeDB O(log N) B+ tree page (rwlock) persistent small but numerous meta data, ordered
DirDB undefined undefined record (rwlock) persistent large but relatively a few data
ForestDB O(log N) B+ tree page (rwlock) persistent relatively larage and many data, ordered

I think the most interesting usage of ForestDB is for inverted index of full-text search system. Inverted index compels records to grow larger and larger. In such case, fragmentation occurres like hell. Altough HashDB is pitted against fragmentation by means of free block pool and auto defragmentation mechanism, a certain extent of fragmentation is inevitable. However, such file systems as EXT3 avoids fragmentation efficiently by scattering each files in the whole partition. So, I'll improve HashDB imitating EXT3 in the near future.

Conclusion

B+ tree on the the directory database is useful and efficient while it was easy to implement. The performance is comparable with the file tree database. And, it is good at storing large records because data location is operated by the file system directory.


Added Directory Database

2010/07/23 11:12
mikio

I've released Kyoto Cabinet 1.1.0, which features the directory database. It is powered by the directory mechanism of the file system and stores records as respective files in a directory.

What is the Directory Database?

Performance of the directory database is strongly based on the file system implementation and its tuning. Some file systems such as EXT2 are not good at storing a lot of files in a directory. But, other file systems are relatively efficient in that situation. In general, file systems featuring B tree or its variants are more suitable than those featuring linear search algorithms. If you have doubt still, please install the latest version of Kyoto Cabinet and try the following command on a modern file system.

kcdirtest order -set casket 100000

It stores records as 100,000 files in one directory and takes only 2.8 seconds in my environment (Linux 2.6.32, EXT3 file system). It indicates that the storing performance is 35,323 QPS until 100,000 records.

How about one million records? It causes heavy disk access and takes 122 seconds (8196 QPS). I should admit that the performance of the directory database is inferior to the file hash database. However, 8196 QPS seems good enough for a lot of casual usage.

Why you use the Directory Database?

Most DBM implementations including Kyoto Cabinet and Tokyo Cabinet are optimized to store small records. As it is, if you want to handle very large records, using the file system directly is better solution than using DBM. If you store and retrieve large records, the processing time by the `write' and `read' system calls is dominant rather than the `open' and `lseek' system calls. Although typical DBMs reduce the workload to locate the position of each record, they increase the workload to read/write the data of each record.

That is, if you want to handle large records but don't want to use the file system directly, use the directory database. Although the directory database is a mere wrapper of the directory mechanism of the file system, it provides a good abstraction to improve portability and maintainability. I also dislike to write "HANDLE fh = CreateFile(path, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_EXISTING, NULL); ..." in application code.

Performance Comparison

I did some tests to store 100,000 records into various databases and then retrieve/remove the records. Each key is an 8 byte numeric string and the value is the same string. My machine has Core i7 CPU 2.0GHz and 4GB RAM.

set get remove
HashDB (EXT2) 0.113 0.094 0.088
TreeDB (EXT2) 0.071 0.085 0.072
DirDB(EXT2) 261.185 0.417 0.953
DirDB(EXT3) 2.917 0.438 8.356
DirDB(EXT4) 4.821 1.646 3.047
DirDB(XFS) 653.015 2.012 507.505
DirDB(ReiserFS) 5.561 0.795 3.817

As the above, the file hash database and the file tree database are much faster than the directory database. And, the performance of the directory database is blazingly swayed among underlying file systems. I think that using EXT3 and ReiserFS are reasonable solutions.

Especially, ReiserFS seems promising. Like DBM, it is optimized to handle relatively small files. Even when storing one million records, it does not cause heavy disk access and takes only 65 seconds.

Additional Features

As with the other database types of Kyoto Cabinet, the directory database supports visitor interface and trancaction. You can convert each record (each file) atomically by a call back function of visitor interface. You can commit a series of operations atomically or rollback it by transaction.

As with the other database types of Kyoto Cabinet, the mechanism for durability is implemented. While transaction (both of explicit transaction and auto transaction), the original data of every updated records is escaped as another file and replace them atomically by the `rename' system call.

The directory database is available from the polymorphic database and all the script bindings. Open the database with a name whose suffix is ".kcd" or specify the type option as "#type=kcd".

Conclusion

Although the directory database is not very fast, it is very easy to use and has reasonable performance. When I was asked how to handle very large data by users, I answered them to use file systems directly. However, from now, I answer them to use the directory database.


Exception in Python-C extension

2010/06/07 10:45
mikio

I'm improving the Python binding of Kyoto Cabinet. I hit the wall as to its implementation, so I ask a favor of Python/C hackers.

Background

By default, each method of the Python binding does not report its error by raising exception. So, you should check the error code every time.

db = DB()
if db.open("casket", DB.OWRITER):
    if not db.put("japan", "tokyo"):
        print("put error:" + str(db.error()))
    if not db.close():
        print("close error:" + str(db.error()))
else:
    print("open error:" + str(db.error()))

The default behavior has a purpose to comply with the common language interface. However, I know that many users prefer the style using exception mechanism for error reporting. So, I want to add an option mode for exception reporting errors of the Python binding. Like this.

db = DB(DB.EXCEPTIONAL)   # enable the exceptional mode
try:
    db.open("casket", DB.OWRITER):
    db.put("japan", "tokyo")
except Error as e:
    print("error:" + str(e))
finally:
    try:
        db.close()
    except Error as e:
        print("error:" + str(e))

For that purpose, I think I should define the "Error" class as a child of the built-in "Exception" or "RuntimeError" or something. I wrote a prototype implementation to do it (see the C code and the package).

Problem

I found there's a memory leak problem in the code. The following test displays that the memory usage (resident size shown by "top") is growing unboundedly.

for i in range(1, 10000000000):
    try:
        print("try: " + str(i))
        sys.stdout.flush()
        raise Error("1: myerror")
    except RuntimeError as e:
        print("catch: " + str(e))
        sys.stdout.flush()

I think that I made some mistakes as to the way of inheritance to the built-in exception class. The current definition of the Error class is implemented as the following.

static bool define_err() {
  static PyTypeObject type_err = { PyVarObject_HEAD_INIT(NULL, 0) };
  size_t zoff = offsetof(PyTypeObject, tp_name);
  std::memset((char*)&type_err + zoff, 0, sizeof(type_err) - zoff);
  type_err.tp_name = "kyotocabinet.Error";
  size_t psiz = ((PyTypeObject*)PyExc_RuntimeError)->tp_basicsize;
  size_t rem = psiz % sizeof(void*);
  if (rem > 0) psiz += sizeof(void*) - rem;
  type_err.tp_basicsize = psiz + sizeof(Error_data);
  type_err.tp_flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE;
  type_err.tp_doc = "Error data.";
  type_err.tp_new = err_new;
  type_err.tp_dealloc = (destructor)err_dealloc;
  type_err.tp_init = (initproc)err_init;
  type_err.tp_repr = (unaryfunc)err_repr;
  type_err.tp_str = (unaryfunc)err_str;
  type_err.tp_richcompare = (richcmpfunc)err_richcmp;
  static PyMethodDef err_methods[] = {
    ...
    { NULL, NULL, 0, NULL }
  };
  type_err.tp_methods = err_methods;
  type_err.tp_base = (PyTypeObject*)PyExc_RuntimeError;
  if (PyType_Ready(&type_err) != 0) return false;
  cls_err = (PyObject*)&type_err;
  ...
  Py_INCREF(cls_err);
  if (PyModule_AddObject(mod_kc, "Error", cls_err) != 0) return false;
  return true;
}

static PyObject* err_new(PyTypeObject* pytype, PyObject* pyargs, PyObject* pykwds) {
  Error_data* data = (Error_data*)pytype->tp_alloc(pytype, 0);
  if (!data) return NULL;
  Py_INCREF(Py_None);
  data->pycode = Py_None;
  Py_INCREF(Py_None);
  data->pymessage = Py_None;
  return (PyObject*)data;
}

static void err_dealloc(Error_data* data) {
  Py_DECREF(data->pymessage);
  Py_DECREF(data->pycode);
  Py_TYPE(data)->tp_free((PyObject*)data);
}

I'm not assured about several points. First, I wondered the assignment to the "tp_basicsize" member is correct or not. Now it is calculated by adding the size of the parent to the size of the self object.

  size_t psiz = ((PyTypeObject*)PyExc_RuntimeError)->tp_basicsize;
  size_t rem = psiz % sizeof(void*);
  if (rem > 0) psiz += sizeof(void*) - rem;
  type_err.tp_basicsize = psiz + sizeof(Error_data);

Second, I wondered the assignment of "PyExc_RuntimeError" to the "tp_base" member is a decent way to declare inheritance.

  type_err.tp_base = (PyTypeObject*)PyExc_RuntimeError;

Third, now the "err_init" function assigned to the "tp_init" member does not explicitly call the initializer function of the parent class. I don't know how to do it and whether to do so.

Help

If you find some flows in my implementation or my design, please tell me as a comment or via e-mail to hirarin@gmail.com. Any suggestion is appreciated. Thanks.


Kyoto Cabinet 1.0.0 was released!

2010/05/25 03:03
mikio

I released the initial stable version of Kyoto Cabinet on 25th May 2010. This entry describes introduction to Kyoto Cabinet.

image:1:1274915061-kclayers.png

Features

Now that all of my plans have been achieved, Kyoto Cabinet has the following features. Especially, Windows support is remarkable.

  • time efficiency: Throughput of updating is more than 100 millions query-per-second.
  • space efficiency: Footprint for each record is 8-16 bytes in the hash DB, 2-4 bytes in the tree DB.
  • concurrency: The hash DB uses read-write lock for each record. The tree DB uses read-write lock for each page.
  • usability: Generic operations of database by interface like the "Visitor" pattern are provided.
  • robustness: Manual transaction, auto transaction, and auto recovery are provided.
  • portability: UNIX-like systems (Linux, FreeBSD, Solaris, Mac OS X) and Windows (VC++) are supported.
  • language bindings: C++, C, Java, Python, Ruby, and Perl are supported.

Compared with Tokyo Cabinet, KC is superior in concurrency, usability, and portability. Although time efficiency for single-thread is better in TC, I recommend KC from now on because multi-core/many-core CPU has been popular. However, I will keep on maintaining TC and fix bugs if they are found.

Language Bindings

I recommend Java first. Typical use cases of KC which I expected mainly are in the backend of large Web services. I think that Java is the most popular one except for C++ in such use cases. Because concurrency is a major sales point of KC, concurrent runtime environments are preferred.

Python and Ruby come second. Although each has the GIL (global interpreter lock) to guard its API from race condition, KC provides "concurrent mode" which uses the API to unlock the GIL temporarily while native functions are called.

Perl is also supported. However, the Perl binding is not thread-safe for ithread (the thread mechanism of Perl). Because ithread is "green thread", which implemented in the user land, such native threading primitives as locking and thread local storage do not work.

Getting Started

Download the latest version from the homepage. Then read the installation section and the tutorial section. Choose your favorite language and write some sample codes.

Basic Example

Although all of the following sample codes are described in the documents for each language, I place them here to compare them with each other. All bindings conform to the common interface defined by IDL.

C++

#include <kcpolydb.h>

using namespace std;
using namespace kyotocabinet;

// main routine
int main(int argc, char** argv) {

  // create the database object
  PolyDB db;

  // open the database
  if (!db.open("casket.kch", PolyDB::OWRITER | PolyDB::OCREATE)) {
    cerr << "open error: " << db.error().name() << endl;
  }

  // store records
  if (!db.set("foo", "hop") ||
      !db.set("bar", "step") ||
      !db.set("baz", "jump")) {
    cerr << "set error: " << db.error().name() << endl;
  }

  // retrieve a record
  string* value = db.get("foo");
  if (value) {
    cout << *value << endl;
    delete value;
  } else {
    cerr << "get error: " << db.error().name() << endl;
  }

  // traverse records
  DB::Cursor* cur = db.cursor();
  cur->jump();
  pair<string, string>* rec;
  while ((rec = cur->get_pair(true)) != NULL) {
    cout << rec->first << ":" << rec->second << endl;
    delete rec;
  }
  delete cur;

  // close the database
  if (!db.close()) {
    cerr << "close error: " << db.error().name() << endl;
  }

  return 0;
}

Java

import kyotocabinet.*;

public class KCDBEX1 {
  public static void main(String[] args) {

    // create the object
    DB db = new DB();

    // open the database
    if (!db.open("casket.kch", DB.OWRITER | DB.OCREATE)){
      System.err.println("open error: " + db.error());
    }

    // store records
    if (!db.set("foo", "hop") ||
        !db.set("bar", "step") ||
        !db.set("baz", "jump")){
      System.err.println("set error: " + db.error());
    }

    // retrieve records
    String value = db.get("foo");
    if (value != null){
      System.out.println(value);
    } else {
      System.err.println("set error: " + db.error());
    }

    // traverse records
    Cursor cur = db.cursor();
    cur.jump();
    String[] rec;
    while ((rec = cur.get_str(true)) != null) {
      System.out.println(rec[0] + ":" + rec[1]);
    }
    cur.disable();

    // close the database
    if(!db.close()){
      System.err.println("close error: " + db.error());
    }

  }
}

Python

from kyotocabinet import *
import sys

# create the database object
db = DB()

# open the database
if not db.open("casket.kch", DB.OWRITER | DB.OCREATE):
    print("open error: " + str(db.error()), file=sys.stderr)

# store records
if not db.set("foo", "hop") or \
        not db.set("bar", "step") or \
        not db.set("baz", "jump"):
    print("set error: " + str(db.error()), file=sys.stderr)

# retrieve records
value = db.get_str("foo")
if value:
    print(value)
else:
    print("get error: " + str(db.error()), file=sys.stderr)

# traverse records
cur = db.cursor()
cur.jump()
while True:
    rec = cur.get_str(True)
    if not rec: break
    print(rec[0] + ":" + rec[1])
cur.disable()

# close the database
if not db.close():
    print("close error: " + str(db.error()), file=sys.stderr)

Ruby

require 'kyotocabinet'
include KyotoCabinet

# create the database object
db = DB::new

# open the database
unless db.open('casket.kch', DB::OWRITER | DB::OCREATE)
  STDERR.printf("open error: %s\n", db.error)
end

# store records
unless db.set('foo', 'hop') and
    db.set('bar', 'step') and
    db.set('baz', 'jump')
  STDERR.printf("set error: %s\n", db.error)
end

# retrieve records
value = db.get('foo')
if value
  printf("%s\n", value)
else
  STDERR.printf("get error: %s\n", db.error)
end

# traverse records
cur = db.cursor
cur.jump
while rec = cur.get(true)
  printf("%s:%s\n", rec[0], rec[1])
end
cur.disable

# close the database
unless db.close
  STDERR.printf("close error: %s\n", db.error)
end

Perl

use KyotoCabinet;

# create the database object
my $db = new KyotoCabinet::DB;

# open the database
if (!$db->open('casket.kch', $db->OWRITER | $db->OCREATE)) {
    printf STDERR ("open error: %s\n", $db->error);
}

# store records
if (!$db->set('foo', 'hop') ||
    !$db->set('bar', 'step') ||
    !$db->set('baz', 'jump')) {
    printf STDERR ("set error: %s\n", $db->error);
}

# retrieve records
my $value = $db->get('foo');
if (defined($value)) {
    printf("%s\n", $value);
} else {
    printf STDERR ("get error: %s\n", $db->error);
}

# traverse records
my $cur = $db->cursor;
$cur->jump;
while (my ($key, $value) = $cur->get(1)) {
    printf("%s:%s\n", $key, $value);
}
$cur->disable;

# close the database
if (!$db->close) {
    printf STDERR ("close error: %s\n", $db->error);
}

Visitor Pattern

All database classes of KC have methods to operate records like associative array. "set", "remove", and "get" are typical. More complex methods such as "increment" and "cas" are also provided by default. However, you may want to use your own operations. The "visitor" pattern is preferable for that purpose. Define a visitor object and pass it to the "accept" method so that the call back method defined by the visitor is executed with a record data atomically.

Atomicity is a key feature in multi-thread environment. While one thread are incrementing a record value by several method call such as "get" and "set", another thread may update the same record at the same time. In that case, former operation get whitewashed. KC solves the problem by the "accept" method which is guarded by record locking.

The following are examples in Java and Ruby. Other language bindings also support the visitor pattern.

Java

import kyotocabinet.*;

public class KCDBEX2 {
  public static void main(String[] args) {

    // create the object
    DB db = new DB();

    // open the database
    if (!db.open("casket.kch", DB.OREADER)) {
      System.err.println("open error: " + db.error());
    }

    // define the visitor
    class VisitorImpl implements Visitor {
      public byte[] visit_full(byte[] key, byte[] value) {
        System.out.println(new String(key) + ":" + new String(value));
        return NOP;
      }
      public byte[] visit_empty(byte[] key) {
        System.err.println(new String(key) + " is missing");
        return NOP;
      }
    }
    Visitor visitor = new VisitorImpl();

    // retrieve a record with visitor
    if (!db.accept("foo".getBytes(), visitor, false) ||
        !db.accept("dummy".getBytes(), visitor, false)) {
      System.err.println("accept error: " + db.error());
    }

    // traverse records with visitor
    if (!db.iterate(visitor, false)) {
      System.err.println("iterate error: " + db.error());
    }

    // close the database
    if(!db.close()){
      System.err.println("close error: " + db.error());
    }

  }
}

Ruby

require 'kyotocabinet'
include KyotoCabinet

# create the database object
db = DB::new

# open the database
unless db.open('casket.kch', DB::OREADER)
  STDERR.printf("open error: %s\n", db.error)
end

# define the visitor
class VisitorImpl < Visitor
  # call back function for an existing record
  def visit_full(key, value)
    printf("%s:%s\n", key, value)
    return NOP
  end
  # call back function for an empty record space
  def visit_empty(key)
    STDERR.printf("%s is missing\n", key)
    return NOP
  end
end
visitor = VisitorImpl::new

# retrieve a record with visitor
unless db.accept("foo", visitor, false) and
    db.accept("dummy", visitor, false)
  STDERR.printf("accept error: %s\n", db.error)
end

# traverse records with visitor
unless db.iterate(visitor, false)
  STDERR.printf("iterate error: %s\n", db.error)
end

# close the database
unless db.close
  STDERR.printf("close error: %s\n", db.error)
end

Popular scripting languages provide "closure" mechanisms and you can use it as a visitor instead of a derived object under the class inheritance mechanism. The following is an example in Python.

from kyotocabinet import *
import sys

# define the functor
def dbproc(db):

  # store records
  db[b'foo'] = b'step';  # bytes is fundamental
  db['bar'] = 'hop';     # string is also ok
  db[3] = 'jump';        # number is also ok

  # retrieve a record value
  print("{}".format(db['foo'].decode()))

  # update records in transaction
  def tranproc():
      db['foo'] = 2.71828
      return True
  db.transaction(tranproc)

  # multiply a record value
  def mulproc(key, value):
      return float(value) * 2
  db.accept('foo', mulproc)

  # traverse records by iterator
  for key in db:
      print("{}:{}".format(key.decode(), db[key].decode()))

  # upcase values by iterator
  def upproc(key, value):
      return value.upper()
  db.iterate(upproc)

  # traverse records by cursor
  def curproc(cur):
      cur.jump()
      def printproc(key, value):
          print("{}:{}".format(key.decode(), value.decode()))
          return Visitor.NOP
      while cur.accept(printproc):
          cur.step()
  db.cursor_process(curproc)

# process the database by the functor
DB.process(dbproc, 'casket.kch')

Conclusion

Kyoto Cabinet is a powerful tool to operate persistent associative array or key-value storage. Time and space efficiency is great. KC is easy to use in most popular languages and provides extreme flexibility by the visitor pattern. Please try it and probably you will take to it.


Is it Really Durable?

2009/11/11 11:45
mikio

I read an article about Tokyo Tyrant and found it intersting. As it says, calling sync operation periodically is useful to improve durability though it sacrifices performance.

However, at least in the typical usecase of TT, that strategy is not suitable. I work at mixi.jp and we operate large databases where the writing throughput is more than 10,000 QPS. So, we use TT in some situations instead of MySQL. We know that the tricks like calling sync to improve durability in each local machine is meaningless.

In mixi.jp, the most frequent cause of the database crash is by hardware breakdown, especially in hard disks and their RAID controller. So, even if we called sync every update operation, no data would be recovered at each local machine. They sometimes can't be booted up any longer.

Yes, we use replication. Every database server has at least one slave database server. That is, all records are replicated among at least two servers. So, if one server crashes, we drop the server out of the service and add a new alternative server. Records of the new server are repaired with the backup database file and difference between the backup time and the current time is solved with the update log of the surviving server by the replication mechanism.

Insistently I say, please use replication, don't believe in your hardwares. Calling sync is just a placebo in actual usecases.