NoSQL: a non-SQL RDBMS
Let me begin this section by saying that big tables should be avoided in the first place. By cleverly organizing your data you may be able to keep your tables within a manageable size even if the content of the overall database is large. Please have a look at section 2.9 of The UNIX Shell As a Fouth Generation Language paper. Remember that NoSQL works with UNIX, not in addition to it. The underlying UNIX file system, that more conventional databases tend to disregard, can provide, when used creatively, an extremely powerful way to efficiently pre-organize your tables (relations) and keep them small (where small means roughly within a few hundred kilobytes each).
If you rather prefer/need to work with large, monolithic data-sets, then you can still do it efficiently by applying your changes to a separate file rather than to the actual table. The changes can then be merged back into the big table (and any indices can be re-built) only every now and again, with a batch job that may be run in the background, overnight or when the system activity is low. The following example will try to explain this better.
Suppose we have the large indexed table
This may seem complicated, but it isn't much, really. Say
printf '\001SURNAME\nSmith\n' | searchtable --index bigtable._x.SURNAME | updtable --no-insert bigtable.updates
As you can see, the trick is:
The 'searchtable' operator is very powerful, but being written in Perl it may take a while to load. If the following conditions are met, with respect to the previous example:
then you only need to perform one single lookup of
If the above is the case -- a common situation -- then here is a faster alternative based on the look(1) shell utility (which is about twice as fast as 'searchtable'):
(head -1 bigtable; look Smith bigtable) | updtable --no-insert bigtable.updates
For convenience, NoSQL provides the 'keysearch' operator, which is a header-aware front-end to the look(1) utility, so that the above can be rewritten as:
keysearch Smith bigtable | updtable --no-insert bigtable.updates
Newer versions of 'keysearch' support secondary indices, thus expensive calls to the more feature-full 'searchtable' can almost always be avoided and the following becomes possible:
keysearch --index bigtable._x.SURNAME Smith bigtable | updtable --no-insert bigtable.updates
For 'keysearch' to use secondary index files built on multiple columns, such as
keysearch --index bigtable._x.SURNAME Smith bigtable | updtable bigtable.updates | getrow 'SURNAME=="Smith"'
This last example shows a general NoSQL tenet: the input stream can be narrowed to a more manageable size by using record pre-selection techniques, such as indexes or other non-sequential structures, but the rest of the processing is done in a linear, i.e. sequential fashion, in line with the underlying Operator-Stream paradigm?.
Sometimes there may be multiple update tables associated with
keysearch --index bigtable._x.SURNAME Smith bigtable > result.tmp uniontable bigtable.updates-* | updtable --stdin result.tmp | getrow 'SURNAME=="Smith"'
Note that indexed access methods, like those provided by 'searchtable' and 'keysearch,' may be worthwhile only if the table being searched is really big, say over a few thousand kilobytes. With a 300 KB table a linear search with 'grep' often is still faster, even on an old PII-233 box. Actually, with 'grep' a lot depends on the complexity of the pattern to be matched, whether we use an extended regular expression, and so forth. Non-sequential access methods always add to complication (indexed files, sorted tables) and should not be used without prior thinking.
Trackbacks (6) | New trackback | Comments (0) | Print