Scientific applications are still poorly served by contemporary relational database systems. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. Time has come to rectify this with SciQL, an SQL-based query language for science applications with arrays as first class citizens. A peek preview of the direction taken is given below. Detailed description of the language and its use cases (in, e.g., remote sensing, seismology and astronomy) can be found in the SciQL publications:
- Y. Zhang, L.H.A. Scheers, M.L. Kersten, M. Ivanova, N. Nes. “Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data”. In Proceedings of the 21st Astronomical Data Analysis Software and Systems (ADASS XXI), Series: ASP Conf. Ser., 2012. (use cases in astronomy)
- M. Koubarakis, et al. “TELEIOS: A Database-Powered Virtual Earth Observatory”. In Proceedings of the 38th International Conference on Very Large Data Bases (VLDB), Istanbul, Turkey, August 2012. (use cases in remote sensing)
- Y. Zhang, M.L. Kersten, M. Ivanova, N. Nes. “SciQL, Bridging the Gap between Science and Relational DBMS”. In Proceedings of the 15th International Database Engineering & Applications Symposium, Lisbon, Portugal, September 2011. (language specification refined and use cases in seismology)
- M.L. Kersten, Y. Zhang, M. Ivanova, N. Nes. “SciQL, A Query Language for Science Applications”. In Proceedings of the first International Array Databases Workshop, Uppsala, Sweden, March 2011. (language specification and use cases in image processing)
Key to success is a query language that achieves a true symbiosis of TABLE semantics with ARRAY semantics in the context of external software libraries. It led to the design of SciQL, where arrays are made first class citizens by enhancing the SQL framework along the innovative lines:
- Seamless integration of array-, set-, and sequence- semantics.
- Named dimensions with constraints as a declarative means for indexed access to array cells.
- Structural grouping to access selective groups of cells based on positional relationships.
- Adaptive storage, where the physical array storage is handled by an adaptive runtime system.
Arrays in SciQL are identified by explicitly named index attributes using DIMENSION constraints. Unlike a TABLE, every index combination denotes an array cell whose non-index value is either explicitly stored or derived from the attribute(s) DEFAULT clause. The array size is fixed if the DIMENSION clause limits it explicitly. The size of unbounded arrays are derived from the actual low/high index values in their representation. The index type can be any of the basic scalar types. The index attribute value NULL denotes the absence of a cell. At the logical level this flavor is indistinguishable from null valued attributes, but their underlying implementation may differ greatly.
Arrays may appear wherever a table expression is allowed in a SQL expression, producing an array if the target list contains index attributes. The SQL iterator semantics associated with TABLEs carry over to ARRAYs, but an iteration is confined to cells whose attributes are not null.
An important operation is to carve out an array slab for further processing. In SciQL we take it a step beyond the windowing features of SQL:2003 by identifying cell groups based on their index relationships. Such cell groups form a pattern, called a tile, which can be subsequently used in a GROUP BY clause to derive all possible tile ocv for statistical aggregation.
CREATE ARRAY stations {
sid integer DIMENSION[0:1024],
name string,
latitude integer,
longitue integer);
CREATE ARRAY seismic (
);
event timestamp DIMENSION['1988':*],
station integer DIMENSION REFERENCES stations.sid,
value integer
-- Array indexing and slicing
SELECT value FROM seismic['1989':'1989'] WHERE station='BEH';
-- Interpolation producing a new array
SELECT [(A.event + next(A.event))/2], (A[event].value + A[next(event)].value) / 2
FROM seismic AS A;
-- 2 x 2 tile based aggregation producing an array
SELECT [x], [y], avg(v) FROM matrix GROUP BY DISTINCT matrix[x:x+2][y:y+2];