Astronomy is one of the oldest scientific disciplines. It studies the celestial objects and phenomena that originate outside the Earth's atmosphere. Methodical observations of the night sky have been performed by many early civilizations(Maya, Greeks, Chinese,...).

Modern astronomy surveys collect tremendous amounts of data. Technological developments allow to construct large digital cameras with high resolution producing billion-pixels images. Examples are:
Pan-STARRS -- the Panoramic Survey Telescope & Rapid Response System -- is an innovative design for a wide-field imaging facility developed at the University of Hawaii's Institute for Astronomy. Very large digital cameras (1.4 billion pixels) allow to observe the entire available sky several times each month.  A major goal of Pan-STARRS is to discover and characterize Earth-approaching objects, both asteroids & comets, that might pose a danger to our planet. It is intended for use also in other astronomical areas, particularly those which involve an aspect of time variability.
LSST
: wide field observations; can detect faint objects ; can detect and track potentially hazardous asteroids.  Data collected will support new type of studies, e.g. of dark matter.

The goal of many recent surveys is to create a detailed map of large areas of the sky. New studies emerge with technology advance that involve time variability aspects, such as discovering asteroids and comets(Pan-STARRS, LSST), and transient detection (LOFAR).  Such studies require regular imaging of the same area of the sky. Hence, they collect even larger amounts of data, need efficient (near real-time) processing of the new data,  and their success depends even more on the availability of efficient data management tools.

Our frame of reference is Sloan Digital Sky Survey (SDSS) and its associated on-line SkyServer application. SDSS is an astronomy survey with the ambition to map one-quarter of the entire sky in detail and perform a red shift survey of galaxies, quasars and stars. SDSS data release 7 (2010) provides terabytes of data products in form of images, imaging and spectral catalogs, and red shifts. The aggregated volume of different imaging and spectral catalogs,  stored in a relational database management system, reached 18TB.The database of the best objects catalog has size of 4TB.

The database schema consists of almost 97 tables, 51 views, and functionality encoded in hundreds of persistent storage modules. The schema is organized in several sections among which Photo and Spectro contain the most important photometric and spectroscopic factual data from the survey.

The database challenges:

  1. Scalability: the astronomy surveys collect huge amounts of catalog data to be managed and efficiently queried. For example, the major photometric fact table in SkyServer DR7 contains more than 450 columns and more than 585 million rows, which already stresses the capabilities of many DBMSs. Furthermore, the system needs to efficiently accommodate new, periodically added data with high volumes.

  2. Data integration issues: different surveys cover various aspects of sky objects in limited, and typically overlapping, areas of the sky. Astronomers need cross-matching of multiple catalogues in efficient and transparent way.

  3. Efficient data exploration tools, that would enable finding a 'needle in a haystack', that is, extracting small-sized data of interest from the data deluge in efficient way.

  4. Near real-time processing: several modern surveys take repetitive images of the sky in search for time-varying phenomena (e.g. transient detection). Fast processing is needed to compare and correlate the new data with the existing maps in order to detect movements. Actually, the telescopes with photographic plates take repetitive images few times per month, while LOFAR receives continuous signal and produces the radio-images as often as determined by some parameters, e.g. every second. These seem to put different performance requirements.

SkyServer Demo
The MonetDB/SkyServer project started with the purpose of providing an experimentation platform to develop new techniques addressing the challenges posed by scientific data management. Our intent was to examine and demonstrate the maturity of column-store technology by providing the functionality required by this real-world astronomy application. The project shows the advantages of vertical storage architectures for scientific applications in a broader perspective. It goes way beyond micro benchmarks and simulations typically used to examine individual algorithms and techniques. MonetDB/SkyServer allows testing the performance of the entire software stack.

MonetDB/SkyServer supports the BESTDR7 database of size of approximately 4TB. It contains the catalog data about the selected best observations of sky objects. This is the only known re-implementation of the application on a platform different from MS SQL Server. Some significant differences from the original application are:

  • Spatial functions implementation:The spatial search functionality in the original application is mainly based on the Hierarchical Triangular Mesh algorithms, implemented as an external library. We chose instead an SQL implementation based on the Zones algorithm. The arguments behind this choice are positive performance results reported by the SkyServer team, and applicability of the SQL optimizer. Note, that this means that direct calls to the external HTM functions are not supported.
  • Column stores: The column-wise storage architecture of MonetDB minimizes the data flow from disk through memory into CPU caches, since only columns relevant for processing are read from disk. This provides efficient data access for disk-bound queries, for which indexes do not exist or cannot be used.
  • Recycling: A recyler optimizer was introduced to catch and re-use subquery results. This resulted in a factor four throughput improvement over the original query log.
  • Reduced data size: The index support of MonetDB is limited to primary and foreign keys. The indices are generated on-the-fly when the columns are touched for the first time. Consequently, the database footprint in MonetDB is 2.6TB, or in other words, the storage needs are reduced with 35%.