does an open source, versioning, indexed, scalable datastore exist?
April 20, 2010 | by rob @ 8:01 am | comments (6) | filed under [databases]

As a software developer, I’ve spent a large portion of my career dealing with data storage and retrievel. As many will attest, the relational model leaves a lot to be desired. It feels like an outdated format, certainly with the advent of the web, commodity computing, and dynamic languages. We end up using caches and search indexes to extract subsets of important data. Sometimes the structure must be denormalized. Or even worse, we have to facilitate proprietary stored procedures and triggers.

Also, the translation from row to object is cumbersome and just doesn’t seem to completely fit. Object-relational mappers do a fairly good job, but still require a lot of work and tuning. Have you ever tried to update an object containing a child collection using Hibernate? It is not a straight forward operation.

what would be better?

Alternatives have cropped up over the years to fill this void. XML and object databases in the early part of the decade. “NoSQL” solutions more recently. I’ll describe what I would like to see in a “full featured” data store in today’s world and what systems come close to meeting those needs.

find-grained caching

Firstly, fine-grained caching or the ability to manage cache content; for example, keeping thousands of user logins cached but only the first fifteen blog posts. Ehcache in conjunction with Hibernate is a good example. In relatively complex applications this is important, you don’t want to fill memory with items that just aren’t going to get accessed that often.

searching

Next is searching for data. Searching gives us the ability to add facets and complete user generated lookups. Lucene is a great example of a full fledged searching product. I don’t think a data store needs to be quite as feature rich, but it seems like a natural way to find data.

flexible schema

Thirdly, a less rigid data definition system would also make sense. In other words, I’d like to be able to add or remove a column of data without restarting the system. Most relational systems can handle this, but are not well suited for adding columns on-the-fly at the application level.

versioning / transactions

Also, I’d also like to see an “immutable” or “versioning” approach to follow the current trend of non-locking in the multicore and functional programming worlds. Basically this means a new copy of the “tuple” (or at least the changed data) with each save, with “rollbacks” occurring during contention situations. This would allow the data to be stored and extracted against a timeline, much like version control software works today. This technique would also create a simple transaction model.

scaling

Another important consideration is scaling and redundancy are important. Servers are now commodities. Virtualization and the cloud permit massive scaling. Any new data storage solution has to take this mentality into account.

security

Finally, a more robust security model would be beneficial. Most relational databases will allow security at the database, table, and even column. Configuring such security with the application is much like adding or removing columns, it just isn’t feasible. Maybe creating separate “stores” with their own set of security credentials (i.e. a store for each user’s payment data) might reduce the potential damage an attacker could wield?

what else is out there?

Many things. There are a ton of systems, so I will attempt to categorize the most recent “NoSQL” solutions and how they fit (or don’t fit) into the basic needs described above.

types key-value store document database column database
examples Tokyo Cabinet, Redis, Project Voldemort, Amazon S3 MongoDB, CouchDB Cassandra
fine-grained caching These systems basically are caches. Typically you can set expiration on an individual key. This really isn’t a “datastore” mentality though, as the data should still persist. Both seem to have their own caching mechanisms internally, but very little user control over what stays in memory. CouchDB does support views, which index and presumably speed data retrieval. No. Though it appears to be a concern.
querying Tokyo Cabinet seems to have simple querying, but it is not (nor should it be) a focus. Applications and data must be designed with a key/value “tree” structure in mind. MongoDB has a query mechanism that feels like an ORM style. CouchDB uses javascript but seems to require views and indexes for searches. Apparently there is a lucene integration plugin. Although the query language is robust.
schema modification There is no schema to modify, just place your data and get it back. Seems like it. Although I wonder about view regeneration in products like CouchDB. Adding columns requires a restart.
versioning / transactions No versioning, but some of the solutions seem to support transactions. Tokyo Cabinet locks the database on transactions. MongoDB and CouchDB do support basic transactions. Neither support versioning (see here). No.
scaling Many of these options support sharding in the client. But not real time increases in capacity. Yes. Yes.
security Some options allow authentication, but not on a resource level. CouchDB has a very fine grained security model. MongoDB seems a little more limited, thought it was hard to find any information beyond authentication. Cassandra doesn’t seem to support authorization, yet.

the proposal

Not being able to find a perfect fit to the requirements listed above, I propose a data system that stores and indexes maps of data. A structure like this:

name: Mr. Data
title: Data Master
description: I am a block of data, please don't lose me or my changes

Would be found in this search:

description: bl*

or this search:

title: "Data Master"

and would produce a language specific construct like a java.util.Map or python dictionary. These maps would be packed into lists.

query system

Queries would be handled in a functional style with chained manipulations of data. For example:

query("occupation: student").sort().limit(5)

would query for records by occupation, sort the results, and limit the size of the results to five objects. This “chaining” would allow for customized extensions.

saving data

“Tuples” might be directories with metadata files. The directory name would be a unique identifier. So a record with four saves would look like this:

afm2acaa-asd2/
afm2acaa-asd2/metadata
afm2acaa-asd2/4
afm2acaa-asd2/2
afm2acaa-asd2/3
afm2acaa-asd2/1

The metadata file might contain the pointer to the latest save and a data version number:

latest: 4
version: 0.1

If two threads attempted to save new versions concurrently, they would produce temporary files:

afm2acaa-asd2/thread1
afm2acaa-asd2/thread2

The only synchronized code would be updating the “metadata” file. If “thread2″ finished after “thread1″, the “commit” operation would fail because it referenced an older version. The “thread1″ changes would be moved into place as:

afm2acaa-asd2/5

and the metadata file could be updated accordingly. This separate file based structure would allow for grepping and easier backups at the file system level. Unfortunately, there would be a large number of files and directories, which could dratamtically impact performance with big datasets.

caching

Caching could be optimized by additions to the query parameters. For example:


cache("five.students").query("occupation: student").sort().limit(5)

would cache the results of the query under the name “five.students”.

relationships

In theory, the results could be combined to produce “relationships”:

cache("five.students").combine(
"class", query("occupation: student").limit(5),
"name", query("type: class")
)

would inject the “class” map into the “student” where the “class” name matches the “student” class attribute. There would be no foreign keys, so data consistency would be up to the application.

replication

Replication is another thorny problem. I think a “shard manager” might be appropriate solution. It would need to run queries against all the database instances and combine the results. Or maybe be configured with some kind of “lookup” algorithm. This might also be a good location to handle caching. Of course, this approach would have to be transparent to the client and easily configured. And managers would need to be able to accept new stores on the fly. And we would want to be able to run multiple managers to avoid a single point of failure.

security and “stores”

Keeping identifying information for “tuples” could be handled with subdirectory “stores”. So our earlier example would look like this:

b2asco21-rjx2/
b2asco21-rjx2/metadata
b2asco21-rjx2/afm2acaa-asd2/metadata
b2asco21-rjx2/afm2acaa-asd2/4
b2asco21-rjx2/afm2acaa-asd2/2
b2asco21-rjx2/afm2acaa-asd2/3
b2asco21-rjx2/afm2acaa-asd2/1

The store metadata could contain a user, password combination:

user: test@test.com
password: 1234

or even an private key:

certificate: ...

A client would have to provide credentials before accessing the store data or caches.

what do you think?

Is this a boneheaded idea? Is there something already out there? Maybe this approach lacks features that would complete the “picture”? If this is actually interesting to you, please let me know. Maybe we can form a new project around it.

mysql manual replication
January 5, 2010 | by rob @ 6:49 am | comments (0) | filed under [databases]

Welcome to my blog! Obviously this is a tool to promote my consulting services, but hopefully readers will be able to find some use from the tools and tips I will continue to post. For a first effort, here is a python script to manually push changes from one MySQL database to another via the binary logs.

You might ask, “if MySQL already has master-slave replication, why do I need this?” Well, if you’d like to run replication as a manual process or if you have multiple databases in the same installation, this will allow you to replicate on your terms.

Warning: this is very much alpha quality software, use at your own risk:

http://github.com/greyrl/generaltools/blob/master/mysqlreplicate.py

This will only run on *NIX type operating systems. Basically it utilizes the “mysqlbinlog” command to extract recent statements from a local binary log and sends the results to another server. You will need the “python-mysqldb” package installed if you are using Ubuntu or will have to run the command “easy_install MySQL-python” for other operating systems. If you find problems or bugs, please leave comments and I will correct them in the original repository.

Also, please read the configuration information at the top. I know this should probably access the MySQL libraries directly with C but this is a first effort and it seems to work. Hope it’s useful for you.