By: Arthur Ritchie - Chairman and CEO at SAND
In recent years, data warehouses have begun to increase radically in size. To maintain acceptable performance in the face of this "data explosion", several techniques have been introduced. These include pre-building aggregates and Key Performance Indicators (KPI’s) from large amounts of detailed transaction data, and indexing as many columns as possible in order to speed up query processing.
As data warehouses continue to grow, however, the time required to do all the necessary preprocessing of data increases to the point where these tasks can no longer be performed in the available "batch windows" when the warehouse is not being accessed by users. So, trade-offs need to be made. Doing less preprocessing work reduces the required time, but also means that queries that depend on aggregates, KPIs or additional indexes may take an inordinately long time to run, and may also severely degrade performance for other users as the system attempts to do the processing "on the fly". This impasse leads to two possible choices: either stop providing the analytic functionality – making the system less valuable, and users more frustrated, -- or “put the database on a diet" by moving some of the data it contains to another location.
Putting the Database "on a Diet"
Both Nearline and Archiving solutions can help trim down an over-expanded database: the database can be made much smaller by implementing an Information Lifecycle Management (ILM) approach, removing unused or infrequently used detailed transactional data from the online database and storing it elsewhere. When the database is smaller, it will perform better and be capable of supporting a wider variety of user needs. Aggregates and KPI’s will be built from a much smaller amount of detailed transaction data. Additionally, column indexing will be more practicable as there will be fewer rows per column to be indexed.
The Key Differences between Archiving and Nearlining in a Data Warehouse
It is important to stress the differences between archiving warehouse data (using products from Open Text, Princeton Softech and so on) and storing it nearline (using SAND/DNA). Since both types of product are used to hold data that has been moved out of the main "online" system, it is unclear to some why one would need to be implemented if the other is in place. To clarify this question and make it easier to discuss why one or the other type of system (or both) might be required in a given situation, the major differences between nearline data and archived data are outlined below.
Normally, the concept of electronic archiving focuses on the preservation of documents or data in a form that has some sort of certifiable integrity (for example, conformity to legal requirements), is immune to unauthorized access and tampering, and is easily subject to certain record management operations within a defined process – for example, automatic deletion after a certain period, or retrieval when requested by an auditor. The archive is in fact a kind of operational system for processing documents/data that are no longer in active use.
The notion of archiving has traditionally focused on unstructured data in the form of documents, but similar concepts can be applied to structured data in the warehouse. An archive for SAP BI, for example, would preserve warehouse data that is no longer needed for analytical use but which needs to be kept around because it may be required by auditors, as would be the case if SAP BI data were used as the basis for financial statements. The archive data does not need to be directly accessible to the user community, just locatable and retrievable in case it is required for inspection or verification – not for analysis in the usual sense. In fact, because much of the data that needs to be preserved in the archive is fairly sensitive (for example, detailed financial data), the ability to access it may need to be strictly regulated.
While many vendors of archiving solutions stress the performance benefits of reducing the amount of data in the online database, accessing the archived data is a complicated and relatively slow process, since it will need to be located and then restored into the online database. For this reason, it is unrealistic to expect archived data to be usable for analysis/reporting purposes.
In the Information Lifecycle Management approach, the nearline repository holds data that is used less frequently than the "hottest" most current data but is still potentially useful for analysis or for constructing new or revised analytic objects for the warehouse.
While the exact proportion of nearline to online data will vary, the amount of "less frequently used" data that needs to be kept available is normally quite large. Moving this out of the main database greatly reduces the pressure on the online database and enables continued performance of standard database operations within available time windows, even in the face of the explosive data growth that many organizations are currently facing.
Thus, the archiving requirements described above do not apply to a nearline product such as SAND/DNA, which is designed to reduce the size of the online warehouse database, while at the same time keeping the data more or less transparently accessible to end users who may need to use it for analysis, for rebuilding KPI's and so on.
Why a Nearline Product is not an Archive
Nearline products do:
- Make older data easily accessible to end users for enhanced analysis/reporting
- Offer very good performance in delivering data to end users - typically not more than 1.x times slower than online, with little or no impact on online users
- Allow greater amounts of relatively recent data to be moved out of the online system
- Offer methods for ensuring the compliance of data with regulations
- Feature any special built-in security regime beyond the read-only status of the data
- Take care of operational processes on data, such as enforcement of retention periods, automatic deletion and so on.
Archiving products do:
- Provide controlled storage of older data that will probably not be accessed except in special circumstances
- Enforce organizational policies with regard to data retention
- Ensure compliance
- Limit access to sensitive data.
- Make data easily accessible to users for analysis or reporting.
- Offer fast performance in restoring data
- Store relatively recent data that may be required for analytics/reporting