As a software developer, I’ve spent a large portion of my career dealing with data storage and retrievel. As many will attest, the relational model leaves a lot to be desired. It feels like an outdated format, certainly with the advent of the web, commodity computing, and dynamic languages. We end up using caches and search indexes to extract subsets of important data. Sometimes the structure must be denormalized. Or even worse, we have to facilitate proprietary stored procedures and triggers.
Also, the translation from row to object is cumbersome and just doesn’t seem to completely fit. Object-relational mappers do a fairly good job, but still require a lot of work and tuning. Have you ever tried to update an object containing a child collection using Hibernate? It is not a straight forward operation.
what would be better?
Alternatives have cropped up over the years to fill this void. XML and object databases in the early part of the decade. “NoSQL” solutions more recently. I’ll describe what I would like to see in a “full featured” data store in today’s world and what systems come close to meeting those needs.
find-grained caching
Firstly, fine-grained caching or the ability to manage cache content; for example, keeping thousands of user logins cached but only the first fifteen blog posts. Ehcache in conjunction with Hibernate is a good example. In relatively complex applications this is important, you don’t want to fill memory with items that just aren’t going to get accessed that often.
searching
Next is searching for data. Searching gives us the ability to add facets and complete user generated lookups. Lucene is a great example of a full fledged searching product. I don’t think a data store needs to be quite as feature rich, but it seems like a natural way to find data.
flexible schema
Thirdly, a less rigid data definition system would also make sense. In other words, I’d like to be able to add or remove a column of data without restarting the system. Most relational systems can handle this, but are not well suited for adding columns on-the-fly at the application level.
versioning / transactions
Also, I’d also like to see an “immutable” or “versioning” approach to follow the current trend of non-locking in the multicore and functional programming worlds. Basically this means a new copy of the “tuple” (or at least the changed data) with each save, with “rollbacks” occurring during contention situations. This would allow the data to be stored and extracted against a timeline, much like version control software works today. This technique would also create a simple transaction model.
scaling
Another important consideration is scaling and redundancy are important. Servers are now commodities. Virtualization and the cloud permit massive scaling. Any new data storage solution has to take this mentality into account.
security
Finally, a more robust security model would be beneficial. Most relational databases will allow security at the database, table, and even column. Configuring such security with the application is much like adding or removing columns, it just isn’t feasible. Maybe creating separate “stores” with their own set of security credentials (i.e. a store for each user’s payment data) might reduce the potential damage an attacker could wield?
what else is out there?
Many things. There are a ton of systems, so I will attempt to categorize the most recent “NoSQL” solutions and how they fit (or don’t fit) into the basic needs described above.
| types | key-value store | document database | column database |
|---|---|---|---|
| examples | Tokyo Cabinet, Redis, Project Voldemort, Amazon S3 | MongoDB, CouchDB | Cassandra |
| fine-grained caching | These systems basically are caches. Typically you can set expiration on an individual key. This really isn’t a “datastore” mentality though, as the data should still persist. | Both seem to have their own caching mechanisms internally, but very little user control over what stays in memory. CouchDB does support views, which index and presumably speed data retrieval. | No. Though it appears to be a concern. |
| querying | Tokyo Cabinet seems to have simple querying, but it is not (nor should it be) a focus. Applications and data must be designed with a key/value “tree” structure in mind. | MongoDB has a query mechanism that feels like an ORM style. CouchDB uses javascript but seems to require views and indexes for searches. | Apparently there is a lucene integration plugin. Although the query language is robust. |
| schema modification | There is no schema to modify, just place your data and get it back. | Seems like it. Although I wonder about view regeneration in products like CouchDB. | Adding columns requires a restart. |
| versioning / transactions | No versioning, but some of the solutions seem to support transactions. Tokyo Cabinet locks the database on transactions. | MongoDB and CouchDB do support basic transactions. Neither support versioning (see here). | No. |
| scaling | Many of these options support sharding in the client. But not real time increases in capacity. | Yes. | Yes. |
| security | Some options allow authentication, but not on a resource level. | CouchDB has a very fine grained security model. MongoDB seems a little more limited, thought it was hard to find any information beyond authentication. | Cassandra doesn’t seem to support authorization, yet. |
the proposal
Not being able to find a perfect fit to the requirements listed above, I propose a data system that stores and indexes maps of data. A structure like this:
name: Mr. Data
title: Data Master
description: I am a block of data, please don't lose me or my changes
Would be found in this search:
description: bl*
or this search:
title: "Data Master"
and would produce a language specific construct like a java.util.Map or python dictionary. These maps would be packed into lists.
query system
Queries would be handled in a functional style with chained manipulations of data. For example:
query("occupation: student").sort().limit(5)
would query for records by occupation, sort the results, and limit the size of the results to five objects. This “chaining” would allow for customized extensions.
saving data
“Tuples” might be directories with metadata files. The directory name would be a unique identifier. So a record with four saves would look like this:
afm2acaa-asd2/
afm2acaa-asd2/metadata
afm2acaa-asd2/4
afm2acaa-asd2/2
afm2acaa-asd2/3
afm2acaa-asd2/1
The metadata file might contain the pointer to the latest save and a data version number:
latest: 4
version: 0.1
If two threads attempted to save new versions concurrently, they would produce temporary files:
afm2acaa-asd2/thread1
afm2acaa-asd2/thread2
The only synchronized code would be updating the “metadata” file. If “thread2″ finished after “thread1″, the “commit” operation would fail because it referenced an older version. The “thread1″ changes would be moved into place as:
afm2acaa-asd2/5
and the metadata file could be updated accordingly. This separate file based structure would allow for grepping and easier backups at the file system level. Unfortunately, there would be a large number of files and directories, which could dratamtically impact performance with big datasets.
caching
Caching could be optimized by additions to the query parameters. For example:
cache("five.students").query("occupation: student").sort().limit(5)
would cache the results of the query under the name “five.students”.
relationships
In theory, the results could be combined to produce “relationships”:
cache("five.students").combine(
"class", query("occupation: student").limit(5),
"name", query("type: class")
)
would inject the “class” map into the “student” where the “class” name matches the “student” class attribute. There would be no foreign keys, so data consistency would be up to the application.
replication
Replication is another thorny problem. I think a “shard manager” might be appropriate solution. It would need to run queries against all the database instances and combine the results. Or maybe be configured with some kind of “lookup” algorithm. This might also be a good location to handle caching. Of course, this approach would have to be transparent to the client and easily configured. And managers would need to be able to accept new stores on the fly. And we would want to be able to run multiple managers to avoid a single point of failure.
security and “stores”
Keeping identifying information for “tuples” could be handled with subdirectory “stores”. So our earlier example would look like this:
b2asco21-rjx2/
b2asco21-rjx2/metadata
b2asco21-rjx2/afm2acaa-asd2/metadata
b2asco21-rjx2/afm2acaa-asd2/4
b2asco21-rjx2/afm2acaa-asd2/2
b2asco21-rjx2/afm2acaa-asd2/3
b2asco21-rjx2/afm2acaa-asd2/1
The store metadata could contain a user, password combination:
user: test@test.com
password: 1234
or even an private key:
certificate: ...
A client would have to provide credentials before accessing the store data or caches.
what do you think?
Is this a boneheaded idea? Is there something already out there? Maybe this approach lacks features that would complete the “picture”? If this is actually interesting to you, please let me know. Maybe we can form a new project around it.
