The need for an intermediate query language
NoSQL's most prominent feature is that its native query language is the UNIX shell. This is a major advantage over the vast majority of other database systems, but it can also be a problem. The shell is an extremely powerful interface to the underlying operating system, sometimes too powerful for the naive database user. Furthermore, not everyone may be granted direct shell access to a NoSQL system, especially on potentially insecure public networks, such as the Internet. For both ease of use and better security, most database-enabled applications that are accessible with an HTML Web browser only offer pre-built query forms, and free-form queries agains the database are generally not allowed.
Sometimes, though, free-form queries may be useful, for instance when the database can be queried by a remote programmatic client, like what happens in traditional SQL/ODBC setups. Occasionally one may want to set up a Web form where a user can enter free-form queries, usually in the form of SQL statements. Of course, accepting native shell statements coming from either a remote program or a Web user would pose serious security issues, as that would be equivalent to granting the remote client entity direct shell access to the server system.
In other words, I have started to feel the need for an
intermediate query language, something that can be used to
de-couple a NoSQL query from the underlying shell interpreter. A language that is much less powerful than the shell, but simpler to learn and still powerful enough to allow arbitrary queries against a NoSQL database. And a language that is still close enough to the original
Operator-Stream Paradigm?. I have dubbed such language
NoSQL Brokering Language (NBL), and although it is still in its early stage of development and may undergo extensive modifications in the future, it can already be used for simple real-world applications. In its present form, NBL can only be used for querying the data and cannot be used for editing or changing it in any way.
NBL requires that a
Database Schema table is provided by the database administrator, containing certain pieces of
meta-data describing the database that is being queried. Not all tables need to be listed in the schema, but only those ones that are to be made available for querying through NBL. NoSQL schemas are currently used only with NBL, but they may be extended in the future to contain additional meta-data to be used also with other NoSQL functions. To date NBL only supports table primary indexes, while
hashed secondary-indexes are not yet supported (see
Big tables).
To show how NBL works I have prepared a sample database, with an associated
Database Schema and an example NBL query. The example database has been freely adapted from the book "
UNIX Relational Database Management", and is
attached here for you to download and play with. The
schema file contained in the tarball is the aforementioned
Database Schema table, and here's its content as displayed by the
justify operator:
Table Path Edit
-------------- ---------------- ----
course ./course
course_student ./course.student
student ./student
Before explaining the details, to give you a taste of how NBL looks like, below is a batch of NBL statements that can be used to query the aforementioned sample database, based on the above schema. In the example tarball these statements are contained in file
query.nbl:
use course course_student student
read course
expression $Course == "chem-1"
join - course_student
columns Student Course
order-by -
join - student
columns Course First Last Phone
format
Such NBL batch, when run through the NoSQL
nblparser operator, produces the following shell program:
#!/bin/sh
set -e
lockfile -r6 -l40 -s8 course.lock course_student.lock student.lock
trap 'rm -f course.lock course_student.lock student.lock' 0
trap 'exit 2' 1 2 3 15
cat ./course |
awktable -H -- "\$Course == \"chem-1\"{print}" |
jointable - ./course.student |
column Student Course |
sorttable |
jointable - ./student |
column Course Last First Phone |
justify
and when such script is piped into
sh(1) the relevant NoSQL query is executed. In other words, by typing
nblparser schema < query.nbl | sh at the command-line and pressing ENTER, we get the following result on the screen:
Course Last First Phone
------ ---------- ------ -------
chem-1 Jefferson Thomas 9876
chem-1 Washington George 87654
chem-1 Reagan Ronald 7654210
NBL syntax
Let's see in greater detail the NBL statements used in the
query.nbl file of the above example.
- use
table [
table ...]
- Declare that we are going to perform a query over the specified tables, and the latter should always be listed in alphabetic order. Declaring beforehand the tables that are going to be queried is optional, but it should always be done when there are chances that any of those tables can be changed by other programs while our query is being run. A multi-table query is always
non-atomical, and the
use statement will set a lock over the involved tables to prevent others from changing them while they are being queried. The
use clause, if used, must appare only once and it must occur before the first
read clause (see below). Note that
use should be used also when we are querying a single table, as it may be that such table has an edit buffer associated with it (see
Big tables), and thus the query will involve multiple files, that is the main table and the edit buffer, and it will therefore be non-atomical by definition.
- read
tablename [[@][
keyvalue]]
- The named table, where
tablename is a
Relation value in the
Database Schema, becomes the current input table (i.e. the
active stream). Every new
read statement closes the previous stream and activates a new one based on the specified table. Most other NBL statements require that a stream has been activated by a previous
read statement, and will complain otherwise. If
keyvalue is specified, then only those records which leftmost field matches
keyvalue are taken into consideration, otherwise the whole table will be used. If
keyvalue begins with a
@ it is considered to be a partial match on the beginning of the table key field, otherwise it defaults to an exact match. It is recommended that
keyvalue be specified whenever possible, to greatly improve performance, especially with large input tables. The
@ character is called
NBL modifier and is used also by other NBL statements, as explained below. The precise meaning of the NBL modifier is context-dependent.
- expr[ession]
expression
- Runs the
awktable NoSQL operator on the active stream. Unless
nblparser is called with the
--unsafe option,
expression is scanned for insecure AWK instructions before it is passed to
awktable.
- [@]join [
table1]
table2 [
column[
,column,...]]
- Joins
table1 and
table2 on a common field. Either
table1 or
table2 (but not both) can be specified as
-, meaning the active stream. That is,
join - othertable uses the active stream as the primary join table, while
join othertable - uses it as the secondary join table. The
othertable argument must correspond to a valid
Database Schema entry. If
column is not specified (or if it is specified as
-),
join will try to infer the key column name(s) from the
Database Schema
Path value; if that fails, the join column will default to the leftmost field in both tables. If
column is specified then the join is done on that field(s), which must be present in both tables although possibly in different column positions. Both the active stream and
othertable are always expected to be already sorted on the join column(s). Remember that if either table is not correctly sorted on the join column(s) the result of the join operation will be unpredictable. The optional
@ modifier at the beginning of the
join command itself tells NBL to perform an
outer join rather than the default
inner join. If only one table is specified, then it must be different from
- and it will be used as the secondary table in the join. If the second table is not specified, not even as
-, then the join column(s) may not be specified either.
- [@]columns
column [
column ...]
- Picks the specified columns from the active stream and drops the remaining ones. If one or more columns are not present in the active stream then they will be added in the respective positions. The optional
@ modifier makes the operator revert its behaviour, i.e. it will pick from the active stream all columns
except those listed. See the NoSQL
getcolumn operator for more.
- [@]order-by
column[
-
flags] [
column[:
flags] ...] : Sorts the active stream on the specified column(s), with optional
flags fields. If
column is not specified, or if it is specified as
-, then the active stream is sorted on whole records. The optional
@ modifier tells NBL to sort in reverse order. See the NoSQL
sorttable operator for more.
- [@]format
- Formats (i.e.
justifies) the active stream so that it becomes more comfortable to look at on the screen. The resulting output is no longer a valid NoSQL table, so no other NBL statements normally follow this one. The optional
@ modifier formats the result in
list format as opposed to the default
table format. See the NoSQL operators
prtable and
tabletolist for more. Note that NBL programs other than those that are run at the command-line will not normally rely on
format to produce a formatted output, but they will rather apply their own filtering to convert the resulting table to HTML or to other formats, depending on their specific needs. The
format statement causes the active stream to be closed, so if the NBL program has to continue a new
read will be necessary to set a new active stream.
Note that with all NBL commands that take a table name as an argument, the latter can be specified as
$table. The leading
$ sign means that the associated name is a
variable name. That is,
table will not be sought for in the
Database Schema but will be expected to be a file-path that was set previously by the NBL program itself (see the
remember-as NBL command below for more).
NBL provides also other commands beside the ones contained in the sample
query.nbl file that has been explained above. Such other commands are listed below.
- setvar
name [
value]
- Inserts a
name=value statement in the resulting shell script. Both
name and
value are very restrictive regarding the characters that they may contain. The
name part must abide by the shell variable naming rules, while the
value part must be a simple string in the set
-_.,=:+A-Za-z0-9_, with some additional restrictions to avoid directory traversal exposures. Any forbidden characters will cause
nblparser to exit with an error. The main purpose of the
setvar statement is to resolove
incomplete
Path entries in the
Database Schema. That is a schema table may contain entries such as:
Table Path Edit
-------------- -------------- ----
tablename ./$[year]/file
That is, the actual path to a table in the file-system may depend on a variable token,
year in this case. Through the
setvar command, NBL provides a way for the NBL user to supply such variable token, for instance with
setvar year 2003. If
value is not provided then
name will be unset.
- system
commands
- Passes
commands to the underlying shell unchanged, with the active stream on STDIN. For obviouos security reasons
system is disabled by default, unless
nblparser is called with the
--unsafe option.
- end
- Stops the current NBL execution, ignoring any other commands that may follow this one. Useful mainly for debugging purposes.
- [@]totals
column [
column ...]
- Runs the
total NoSQL operator on the specified columns of the active stream. The optional
@ modifier at the beginning of the
totals command tells NBL to assume that the specified columns contain currency values.
- labels
newlabel [
newlabel ...]
- Replace the output column labels (i.e.
names) to the specified new labels. If not enough labels are provided then the excess input column names will be left unchanged, while if too many labels are specified then the extra ones will be ignored. Note that the
labels directive will be applied only at the end of the query, regardless of where it appares in the sequence of NBL statements. Furthermore, it is applied only once, i.e. if multiple
labels statements are present, only the last one will be effective. This is done to avoid the risk of accidentally exposing hidden columns on output.
- ldap
suffix
expression
- Turns an LDAP search expression into NBL. This is mostly meant to use NoSQL as the shell backend of
slapd(8), the
OpenLDAP database server. Please consider that
nblparser only understands a very limited subset of the LDAP query syntax, although such subset has proven to be enough for many practical situations.
- remember-as
name
- Assigns the symbolic name
name to the active stream and closes it. A new
read against a (possibly) different table will then be necessary if the NBL program is to continue. From this point on, it will be possible for subsequent NBL commands to refer to
name as
$name, as explained further up in this page. For example:
# set "course" as the active stream.
read course
expression $Course == "chem-1"
remember-as goofy
# set $goofy as the active stream.
read $goofy
format
Of course this particular example may not look very useful, and in fact it isn't, but there are cases where the possibility of saving the query that has been done up to a certain point, and re-using it at a later stage in the program, may indeed come handy.
- [@]unique-by
column[
-
flags] [
column[:
flags] ...] : Same as
order-by but the output is made unique on the specified column(s). See
order-by for more.
Trackbacks (2) |
New trackback |
Comments (0) |
Print
|