CSA makes frequent use of flat-file ASCII tables, as this is the most straightforward and easy way of organizing and retrieving information. Because of this, and although CSA does not depend on any particular Database Management System (DBMS), I tend to use it mostly with NoSQL , a simple Relational Database System that I have developed over the years and that works with ASCII tables. Of course you may prefer one of the many real SQL databases that are available, and they too can be used with CSA, provided they can be queried also through a shell-level command.
Record-oriented flat-files lend themselves well to be manipulated with
standard UNIX utilities, like grep, sed and the countless
others. Unfortunately, linear scanning of large datasets may negatively
impact system performance. To mitigate such problem, a flat-file table
can be made much more manageable by turning it into a binary tree of files,
that is by distributing the record key space
into separate files, or Key-Clusters, and let the file-system
do the work instead of the CPU. With CSA, a record key field is always
the first (leftmost) field in a TAB-separated table. The path to the
single cluster containing a given key will then be a hash
function of the relevant key value. For instance, in a two-level
clustered structure the relative path to the file cointaining the
keys goofy
, goose
and goblin
could be
./g/o.data
. The ".data" suffix on the file name is just
customary, but apart from that it does not serve any special purpose.
I have called this way of splitting larger datasets into subfiles
Path-Based Clustering (PBC).