Table structure

A table (or relation) is an ordinary ASCII file, with some additional rules that make it possible to use it as a database table. The file has records (rows) and fields (columns). The relation, or table structure, is achieved by separating the columns with ASCII TAB characters and terminating the rows with ASCII NEWLINE characters. That is, each row of data in a file contains the data values (each in a data field) separated by TAB characters and terminated with a NEWLINE character. Therefore a fundamental rule is that data values must NOT contain TAB characters.

The first section of the file, called the header, contains the file structure information used by the operators. The rest of the file, called the body, contains the actual data values. A file of data, so structured, is said to be a "table".

The header consists of exactly one line that contains the structure information: the column name row. Each column name begins with an octal "\001" character, i.e. ASCII start of header ("SOH") character, and is separated from the next column name by an ASCII TAB. The SOH character must uniquely mark the beginning of column names, and especially the beginning of the table header, and therefore it may not appear anywhere else as the first character in a row. The fields in the column name row contain the names of each column, and are separated from each other by a single TAB character. In the rest of this document, the SOH characters usually will be omitted from the explanations, but it is implied that they must always be present at the beginning of each column name.

The SOH character was chosen for a reason: it is the character that sorts at the top in lexicographical order, so that the table header does not usually need to be treated especially when sorting or performing most other table manipulations. It has been added to the beginning of each column name, not just the first one, to make it survive "table slicing" and column reordering. See the Table header page for a dissertation on this problematic topic.

Trailing blank (i.e. 'space') characters in data columns are NOT considered significant, and NoSQL operators may freely decide to drop them at will. Note that while it is OK to have leading blanks in the data rows of a table, not all NoSQL programs will accept them in the table header.

The column names are case sensitive, i.e. "COUNT" is different from "Count." The guideline for characters that may be used in column names is that alphabetic, numeric, and the underscore (_) are good choices. Numeric-only column names are not allowed.

The TAB character must never be used in column names, nor should spaces or UNIX I/O redirection characters (<,>,|) be used. To be on the safe side, column names should always start with a letter and contain only upper and lower case letters, numbers, and the underscore (_). The following names are reserved to the AWK programming language, and should not be used to indicate column names:

BEGIN, END, break, continue, else, exit, exp, for, getline, gsub, if, in, index, int, length, log, next, print, printf, split, sprintf, sqrt, sub, substr, while, and possibly others. Refer to your AWK man page. Furthermore, the "_" character should never be used at the beginning of column names, as that is sometimes used by NoSQL for internal purposes. Finally, appending the "_" character at the end of column names, although perfectly acceptable, should be reserved to special needs of the application program, like inserting dummy columns in the STDOUT of a query, and to other circumstances where the application needs to use column names that are not supposed to appear in a table.

For instance, if you have a table that maps names to nicknames, then the table's two columns could be called Name and Nickname. Some NoSQL operators create new columns that have the same name as pre-existing table columns with lower-case letters prepended to them. This is why you really should stick to these rules.

Not abiding by these naming rules may still work, but there may be unexpected results.

A sample table (named sample.table) that will be used in later examples is shown in Table 1. The picture in Table 1 is for illustrative purposes; what the file would actually look like is shown in Table 2, where a TAB character is represented by "<T>" a NEWLINE is represented by "<N>" and a SOH by "<S>."

Table 1: sample.table
     NAME    COUNT   TYP     AMT
     Bush       44     A     133
     Hansen     44     A     23
     Jones      77     X     77
     Perry      77     B     244
     Hart       77     D     1111
     Holmes     65     D     1111
Table 2: sample.table (actual content)
     <S>NAME<T><S>COUNT<T><S>TYP<T><S>AMT<N>
     Bush<T>44<T>A<T>133<N>
     Hansen<T>44<T>A<T>23<N>
     Jones<T>77<T>X<T>77<N>
     Perry<T>77<T>B<T>244<N>
     Hart<T>77<T>D<T>1111<N>
     Holmes<T>65<T>D<T>1111<N>

It is important to note that only actual data is stored in the data fields, with no leading or trailing space characters. This fact can (and usually does) have a major effect on the size of the resulting data files (tables) compared to data stored in "fixed field width" systems. The data files in NoSQL are almost always smaller, sometimes dramatically smaller.

A table can also be represented in a different format, called "list format". The list format of the above sample.table is:

       NAME  Bush
       COUNT 44
       TYP   A
       AMT   133
 
       NAME  Hansen
       COUNT 44
       TYP   A
       AMT   23
 
       NAME  Jones
       COUNT 77
       TYP   X
       AMT   77
 
       NAME  Perry
       COUNT 77
       TYP   B
       AMT   244
 
       NAME  Hart
       COUNT 77
       TYP   D
       AMT   1111
 
       NAME  Holmes
       COUNT 65
       TYP   D
       AMT   1111
The actual contents of a table in 'list' format, showing newlines and TABs is:

       <N>
       NAME<T>Bush<N>
       COUNT<T>44<N>
       TYP<T>A<N>
       AMT<T>133<N>
       <N>
       NAME<T>Hansen<N>
       COUNT<T>44<N>
       TYP<T>A<N>
       AMT<T>23<N>
       <N>
       NAME<T>Jones<N>
       COUNT<T>77<N>
       TYP<T>X<N>
       AMT<T>77<N>
       <N>
       NAME<T>Perry<N>
       COUNT<T>77<N>
       TYP<T>B<N>
       AMT<T>244<N>
       <N>
       NAME<T>Hart<N>
       COUNT<T>77<N>
       TYP<T>D<N>
       AMT<T>1111<N>
       <N>
       NAME<T>Holmes<N>
       COUNT<T>65<N>
       TYP<T>D<N>
       AMT<T>1111<N>
       <N>

Long lines, i.e. lines that are too long to fit in the width of the screen, may be folded over multiple rows in the 'list' format, provided that each continuation row starts with one TAB character. Field (column) names need to be separated by the associated data by exactly one TAB characters. The data part may contain physical TABs and newlines, which will be turned into '\t' and '\n' escapes respectively by the 'listtotable' operator when the list is turned into a table.

       COMMENTS  This is a very looong comment, that I want to fold over
         multiple lines.

and the actual content is :

       <N>
       COMMENTS<T>This is a very looong comment, that I want to fold over<N>
       <T>multiple lines.<N>
       <N>

As we will see, there are NoSQL operators that convert back and forth between "table" and "list" formats. The list leading empty row is called the list header, while the trailing empty row is called the list footer.

Yet another data input format supported by NoSQL is the so-called "multiplexed format" (or "muxed" format, for short). The muxed format is somewhat similar to the "list" format, but with less constraints on its structure. One possible muxed format of the above sample.table is:

       NAME  Bush
       TYP   A

       NAME  Hansen
       COUNT 44

       NAME  Jones
       COUNT 77
       TYP   X
       AMT   77

       NAME  Bush
       COUNT 44
       AMT   133

       NAME  Hansen
       TYP   A
       AMT   23

       NAME  Perry
       COUNT 77

       NAME  Hart
       COUNT 77

       NAME  Perry
       TYP   B
       AMT   244

       NAME  Hart
       TYP   D
       AMT   1111

       NAME  Holmes
       COUNT 65

       NAME  Holmes
       TYP   D
       AMT   1111

In the example the rows are interspersed with empty lines for readability but, unlike the more prescriptive 'list' format, a muxed sequence needs not be spaced, and the above is still perfectly valid if written like this:

       NAME  Bush
       TYP   A
       NAME  Hansen
       COUNT 44
       NAME  Jones
       COUNT 77
       TYP   X
       AMT   77
       ... etc ...

For such list to be turned into a valid NoSQL table, the 'muxtotable' operator can be used:

muxtotable --key-columns NAME < sample.mux

The resulting stream on STDOUT will be like this:

       NAME    TYP  COUNT   AMT
       Bush    A               
       Hansen          44      
       Jones   X       77    77
       Bush            44   133
       Hansen  A             23
       Perry           77      
       Hart            77      
       Perry   B            244
       Hart    D           1111
       Holmes          65      
       Holmes  D           1111

The --key-columns argument of 'muxtotable' is optional, and if it is not specified then the name of the first name/value line in 'sample.mux' will be considered by 'muxtotable' as the output table's primary key field (that is 'NAME' in this case). See muxtotable --help for more information regarding how the operator deals with table keys.

The muxed data format can come especially handy when an application program needs to update data in a NoSQL table. Instead of having to deal with tabular structures the application can simply emit the changes on STDOUT in the more natural muxed format and feed them directly into updtable --stdin. For this to work, the application program only needs to know the name of the table's primary key column and tell muxtotable about it through the --key-columns option of the latter operator. If the target table primary key spans over multiple columns then the --key-columns option will have to specify them all in a comma-separated list. Whenever any of the specified key columns occur in the muxed stream, muxtotable will make sure that it is followed also by updated values for each and every other key column before any further non-key name/value pairs can be accepted. If an update for a non-key column is found in the muxed stream before at least one value for each table key column is found, then muxtotable will throw an exception. Again, if muxtotable isn't explicitly told about the tables key column(s) with the --key-fields switch, the operator will assume that the very first entry occurring in the muxed stream is the table primary key.

The muxtotable operator can readily be used to create the bigtable.updates file referred to in section Big Tables.


Trackbacks (2) | New trackback | Comments (0) | Print