Next Previous Contents

4. Data tables

4.1 Path-Based Clustering.

CSA makes frequent use of flat-file ASCII tables, as this is the most straightforward and easy way of organizing and retrieving information. Because of this, and although CSA does not depend on any particular Database Management System (DBMS), I tend to use it mostly with NoSQL , a simple Relational Database System that I have developed over the years and that works with ASCII tables. Of course you may prefer one of the many real SQL databases that are available, and they too can be used with CSA, provided they can be queried also through a shell-level command.

Record-oriented flat-files lend themselves well to be manipulated with standard UNIX utilities, like grep, sed and the countless others. Unfortunately, linear scanning of large datasets may negatively impact system performance. To mitigate such problem, a flat-file table can be made much more manageable by turning it into a binary tree of files, that is by distributing the record key space into separate files, or Key-Clusters, and let the file-system do the work instead of the CPU. With CSA, a record key field is always the first (leftmost) field in a TAB-separated table. The path to the single cluster containing a given key will then be a hash function of the relevant key value. For instance, in a two-level clustered structure the relative path to the file cointaining the keys goofy, goose and goblin could be ./g/o.data. The ".data" suffix on the file name is just customary, but apart from that it does not serve any special purpose. I have called this way of splitting larger datasets into subfiles Path-Based Clustering (PBC).


Next Previous Contents