From the earthquake glossary: "A seismograph, or seismometer, is an instrument used to detect and record earthquakes. Generally, it consists of a mass attached to a fixed base. During an earthquake, the base moves and the mass does not. The motion of the base with respect to the mass is commonly transformed into an electrical voltage. The electrical voltage is recorded on paper, magnetic tape, or another recording medium. This record is proportional to the motion of the seismometer mass relative to the earth, but it can be mathematically converted to a record of the absolute motion of the ground. Seismograph generally refers to the seismometer and its recording device as a single unit."
Our frame of reference for the data is ORFEUS (Observatories and Research Facilities for European Seismology) that aims at co-ordinating and promoting digital, broadband (BB) seismology in the European-Mediterranean area. ORFEUS offers both data (SEED, GSE and SAC) and metadata (instrument responses) through different protocols (FTP, WWW, E-mail). Recent waveform data (since 2002) is offered both continuously and event oriented. Waveform data in the time-span 1988-2002 is offered event oriented. This data can be downloaded directly from FTP.
The database scheme underlying the waveform event readings is a rather straightforward star-schema. The fact table consists of the seismograph readings as partial time-series attached to seismometers in the field. The source data is stored as more than 3.2M MSEED files with a highly compressed storage footprint of ca 0.3TB.
The database challenges:
- Database ingestion: a naive ingestion of the MSEED files into a database using ordinary SQL INSERT or COPY INTO statements would leed to a job that can taken anywhere from a few weeks to multiple years on a single machine. There are over 150B events to be injected.
- Database queries: the predominant access pattern follows the file repository structure, taking a time-space slab for offline processing. The user tool set should be analised to extract the primitive operations to be considered part of the query language user defined function repertoire. We miss the Jim Gray's 20-queries set.
- Database response: post processing of the data is handled by tools reading the MSEED files. Such tools are often memory limited and the data selected from a database may lead to a large number of MSEED records with just a few events.
- Database quality: the data quality requirements are quite stringent. Database support would be a great help to e.g. detect gaps in the time series, handle outliers, and determine time shifts.
The ORFEUS demo
The MonetDB ORFEUS demo was started to study techniques to encapsulate the MSEED file repositoriess kept at ORFEUS transparently. The MSEED files are compact representations of the recordings. Each file consists of a series of records, where each record captures several thousands of readings at regular intervals. An adaptive delta-encoding, developed by Steim, reduces the storage space requirement significantly. Decompressing the database would lead to a 10-fold storage footprint.
Instead, a portion of the MSEED files are cached locally for processing in a MonetDB vault. A catalog of MSEED record headers is extracted once. It contains the meta information and simple statistics to support the predominant query. Subsequently, an MSEED specific optimizer module takes the queries and either decompresses the files into temporary tables, or calls upon routines available in the MSEED library to access the data directly.