NoSQL Relational Database Management System: NoSQL Brokering Language

The need for an intermediate query language

NoSQL's most prominent feature is that its native query language is the UNIX shell. This is a major advantage over the vast majority of other database systems, but it can also be a problem. The shell is an extremely powerful interface to the underlying operating system, sometimes too powerful for the naive database user. Furthermore, not everyone may be granted direct shell access to a NoSQL system, especially on potentially insecure public networks, such as the Internet. For both ease of use and better security, most database-enabled applications that are accessible with an HTML Web browser only offer pre-built query forms, and free-form queries against the database are generally not allowed.

Sometimes, though, free-form queries may be useful, for instance when the database can be queried by a remote programmatic client, like what happens in traditional SQL/ODBC setups. Occasionally one may want to set up a Web form where a user can enter free-form queries, usually in the form of SQL statements. Of course, accepting native shell statements coming from either a remote program or a Web user would pose serious security issues, as that would be equivalent to granting the remote client entity direct shell access to the server system.

In other words, I have started to feel the need for an intermediate query language, something that can be used to de-couple a NoSQL query from the underlying shell interpreter. A language that is much less powerful than the shell, but simpler to learn and still powerful enough to allow arbitrary queries against a NoSQL database. And a language that is still close enough to the original Operator-Stream Paradigm^?. I have dubbed such language NoSQL Brokering Language (NBL), and although it is still in its early stage of development and may undergo extensive modifications in the future, it can already be used for simple real-world applications. In its present form, NBL can only be used for querying the data and cannot be used for editing or changing it in any way.

NBL requires that a Database Schema table is provided by the database administrator, containing certain pieces of meta-data describing the database that is being queried. Not all tables need to be listed in the schema, but only those ones that are to be made available for querying through NBL. NoSQL schemas are currently used only with NBL, but they may be extended in the future to contain additional meta-data to be used also with other NoSQL functions. To date NBL only supports table primary indexes, while hashed secondary-indexes are not yet supported (see Big tables).

To show how NBL works I have prepared a sample database, with an associated Database Schema and an example NBL query. The example database has been freely adapted from the book " UNIX Relational Database Management", and is attached here for you to download and play with. The schema file contained in the tarball is the aforementioned Database Schema table, and here's its content as displayed by the justify operator:

  Table           Path              Edit
  --------------  ----------------  ----
  course          ./course
  course_student  ./course.student
  student         ./student

Before explaining the details, to give you a taste of how NBL looks like, below is a batch of NBL statements that can be used to query the aforementioned sample database, based on the above schema. In the example tarball these statements are contained in file query.nbl:

  use             course course_student student
  read            course
  expression      $Course == "chem-1"
  join            - course_student
  columns         Student Course
  order-by        -
  join            - student
  columns         Course First Last Phone
  format

Such NBL batch, when run through the NoSQL nblparser operator, produces the following shell program:

  #!/bin/sh
  set -e
  lockfile -r6 -l40 -s8 course.lock course_student.lock student.lock
  trap 'rm -f course.lock course_student.lock student.lock' 0
  trap 'exit 2' 1 2 3 15
  cat ./course |
  awktable -H -- "\$Course == \"chem-1\"{print}" |
  jointable - ./course.student |
  column Student Course |
  sorttable |
  jointable - ./student |
  column Course Last First Phone |
  justify

and when such script is piped into sh(1) the relevant NoSQL query is executed. In other words, by typing nblparser schema < query.nbl | sh at the command-line and pressing ENTER, we get the following result on the screen:

  Course  Last        First     Phone
  ------  ----------  ------  -------
  chem-1  Jefferson   Thomas     9876
  chem-1  Washington  George    87654
  chem-1  Reagan      Ronald  7654210

NBL syntax

Let's see in greater detail the NBL statements used in the query.nbl file of the above example.

use table [ table ...]: Declare that we are going to perform a query over the specified tables, and the latter should always be listed in alphabetic order. Declaring beforehand the tables that are going to be queried is optional, but it should always be done when there are chances that any of those tables can be changed by other programs while our query is being run. A multi-table query is always non-atomical, and the use statement will set a lock over the involved tables to prevent others from changing them while they are being queried. The use clause, if used, must appare only once and it must occur before the first read clause (see below). Note that use should be used also when we are querying a single table, as it may be that such table has an edit buffer associated with it (see Big tables), and thus the query will involve multiple files, that is the main table and the edit buffer, and it will therefore be non-atomical by definition.
read tablename [[@][ keyvalue]]: The named table, where tablename is a Relation value in the Database Schema, becomes the current input table (i.e. the active stream). Every new read statement closes the previous stream and activates a new one based on the specified table. Most other NBL statements require that a stream has been activated by a previous read statement, and will complain otherwise. If keyvalue is specified, then only those records which leftmost field matches keyvalue are taken into consideration, otherwise the whole table will be used. If keyvalue begins with a @ it is considered to be a partial match on the beginning of the table key field, otherwise it defaults to an exact match. It is recommended that keyvalue be specified whenever possible, to greatly improve performance, especially with large input tables. The @ character is called NBL modifier and is used also by other NBL statements, as explained below. The precise meaning of the NBL modifier is context-dependent.
expr[ession] expression: Runs the awktable NoSQL operator on the active stream. Unless nblparser is called with the --unsafe option, expression is scanned for insecure AWK instructions before it is passed to awktable.
[@]join [ table1] table2 [ column[ ,column,...]]: Joins table1 and table2 on a common field. Either table1 or table2 (but not both) can be specified as -, meaning the active stream. That is, join - othertable uses the active stream as the primary join table, while join othertable - uses it as the secondary join table. The othertable argument must correspond to a valid Database Schema entry. If column is not specified (or if it is specified as -), join will try to infer the key column name(s) from the Database Schema Path value; if that fails, the join column will default to the leftmost field in both tables. If column is specified then the join is done on that field(s), which must be present in both tables although possibly in different column positions. Both the active stream and othertable are always expected to be already sorted on the join column(s). Remember that if either table is not correctly sorted on the join column(s) the result of the join operation will be unpredictable. The optional @ modifier at the beginning of the join command itself tells NBL to perform an outer join rather than the default inner join. If only one table is specified, then it must be different from - and it will be used as the secondary table in the join. If the second table is not specified, not even as -, then the join column(s) may not be specified either.
[@]columns column [ column ...]: Picks the specified columns from the active stream and drops the remaining ones. If one or more columns are not present in the active stream then they will be added in the respective positions. The optional @ modifier makes the operator revert its behaviour, i.e. it will pick from the active stream all columns except those listed. See the NoSQL getcolumn operator for more.
[@]order-by column[: flags] [ column[: flags] ...]: Sorts the active stream on the specified column(s), with optional flags fields. If column is not specified, or if it is specified as -, then the active stream is sorted on whole records. The optional @ modifier tells NBL to sort in reverse order. See the NoSQL sorttable operator for more.
[@]format: Formats (i.e. justifies) the active stream so that it becomes more comfortable to look at on the screen. The resulting output is no longer a valid NoSQL table, so no other NBL statements normally follow this one. The optional @ modifier formats the result in list format as opposed to the default table format. See the NoSQL operators prtable and tabletolist for more. Note that NBL programs other than those that are run at the command-line will not normally rely on format to produce a formatted output, but they will rather apply their own filtering to convert the resulting table to HTML or to other formats, depending on their specific needs. The format statement causes the active stream to be closed, so if the NBL program has to continue a new read will be necessary to set a new active stream.

Note that with all NBL commands that take a table name as an argument, the latter can be specified as $table. The leading $ sign means that the associated name is a variable name. That is, table will not be sought for in the Database Schema but will be expected to be a file-path that was set previously by the NBL program itself (see the remember-as NBL command below for more).

NBL provides also other commands beside the ones contained in the sample query.nbl file that has been explained above. Such other commands are listed below.

setvar name [value]: Inserts a name=value statement in the resulting shell script. Both name and value are very restrictive regarding the characters that they may contain. The name part must abide by the shell variable naming rules, while the value part must be a simple string in the set -_.,=:+A-Za-z0-9_, with some additional restrictions to avoid directory traversal exposures. Any forbidden characters will cause nblparser to exit with an error. The main purpose of the setvar statement is to resolove incomplete Path entries in the Database Schema. That is a schema table may contain entries such as:

  Table           Path            Edit
  --------------  --------------  ----
  tablename       ./$[year]/file

That is, the actual path to a table in the file-system may depend on a variable token, year in this case. Through the setvar command, NBL provides a way for the NBL user to supply such variable token, for instance with setvar year 2003. If value is not provided then name will be unset.

system commands: Passes commands to the underlying shell unchanged, with the active stream on STDIN. For obviouos security reasons system is disabled by default, unless nblparser is called with the --unsafe option.
end: Stops the current NBL execution, ignoring any other commands that may follow this one. Useful mainly for debugging purposes.
[@]totals column [ column ...]: Runs the total NoSQL operator on the specified columns of the active stream. The optional @ modifier at the beginning of the totals command tells NBL to assume that the specified columns contain currency values.
labels newlabel [ newlabel ...]: Replace the output column labels (i.e. names) to the specified new labels. If not enough labels are provided then the excess input column names will be left unchanged, while if too many labels are specified then the extra ones will be ignored. Note that the labels directive will be applied only at the end of the query, regardless of where it appares in the sequence of NBL statements. Furthermore, it is applied only once, i.e. if multiple labels statements are present, only the last one will be effective. This is done to avoid the risk of accidentally exposing hidden columns on output.
ldap suffix expression: Turns an LDAP search expression into NBL. This is mostly meant to use NoSQL as the shell backend of slapd(8), the OpenLDAP database server. Please consider that nblparser only understands a very limited subset of the LDAP query syntax, although such subset has proven to be enough for many practical situations.
remember-as name: Assigns the symbolic name name to the active stream and closes it. A new read against a (possibly) different table will then be necessary if the NBL program is to continue. From this point on, it will be possible for subsequent NBL commands to refer to name as $name, as explained further up in this page. For example:

  # set "course" as the  active stream.
  read            course
  expression      $Course == "chem-1"
  remember-as     goofy

  # set $goofy as the active stream.
  read            $goofy
  format

Of course this particular example may not look very useful, and in fact it isn't, but there are cases where the possibility of saving the query that has been done up to a certain point, and re-using it at a later stage in the program, may indeed come handy.

[@]unique-by column[: flags] [ column[: flags] ...]: Same as order-by but the output is made unique on the specified column(s). See order-by for more.

Trackbacks (3) | New trackback | Comments (0) | Print