The back room area of the data warehouse has frequently been called the staging area. Staging in this context means writing to disk and, at a minimum, I recommend staging data at the four major checkpoints of the ETL data flow. But, the main cuestion in this note is: When i need to design stage area in my data warehouse project?
To Stage or Not to Stage
The decision to store data in a physical staging area versus processing it in memory is ultimately the choice of the ETL architect. The ability to develop efficient ETL processes is partly dependent on being able to determine the right balance between physical input and output (I/O) and in-memory processing.
The challenge of achieving this delicate balance between writing data to staging tables and keeping it in memory during the ETL process is a task that must be reckoned with in order to create optimal processes. The issue with determining whether to stage your data or not depends on two conflicting objectives:
- Getting the data from the originating source to the ultimate target as
fast as possible
- Having the ability to recover from failure without restarting from the beginning of the process
To Stage or Not to Stage
The decision to store data in a physical staging area versus processing it in memory is ultimately the choice of the ETL architect. The ability to develop efficient ETL processes is partly dependent on being able to determine the right balance between physical input and output (I/O) and in-memory processing.
The challenge of achieving this delicate balance between writing data to staging tables and keeping it in memory during the ETL process is a task that must be reckoned with in order to create optimal processes. The issue with determining whether to stage your data or not depends on two conflicting objectives:
- Getting the data from the originating source to the ultimate target as
fast as possible
- Having the ability to recover from failure without restarting from the beginning of the process
The decision to stage data varies depending on your environment and business requirements. If you plan to do all of your ETL data processing in memory, keep in mind that every data warehouse, regardless of its architecture or environment, includes a staging area in some form or another.Consider the following reasons for staging data before it is loaded into the data warehouse:
- Recoverability. In most enterprise environments, it’s a good practice to stage the data as soon as it has been extracted fromthe source system and then again immediately after each of the major transformation steps, assuming that for a particular table the transformation steps are significant. These staging tables (in a database or file system) serve as recovery points. By implementing these tables, the process won’t have to intrude on the source system again if the transformations fail. Also, the process won’t have to transform the data again if the load process fails. When staging data purely for recovery purposes, the data should be stored in a sequential file on the file system rather than in a database. Staging for recoverability is especially important when extracting from operational systems that overwrite their own data.
- Backup. Quite often, massive volume prevents the data warehouse from being reliably backed up at the database level.We’ve witnessed catastrophes that might have been avoided if only the load files were saved, compressed, and archived. If your staging tables are on the file system, they can easily be compressed into a very small footprint and saved on your network. Then if you ever need to reload the data warehouse, you can simply uncompress the load files and reload them.
- Auditing. Many times the data lineage between the source and target is lost in the ETL code. When it comes time to audit the ETL process, having staged data makes auditing between different portions of the ETL processes much more straightforward because auditors (or programmers) can simply compare the original input file with the logical transformation rules against the output file. This staged data is especially useful when the source system overwrites its history. When questions about the integrity of the information in the data warehouse surface days or even weeks after an event has occurred, revealing the staged extract data from the period of time in question can restore the trustworthiness of the data warehouse.
Once you’ve decided to stage at least some of the data, you must settle on the appropriate architecture of your staging area. As is the case with any other database, if the data-staging area is not planned carefully, it will fail. Designing the data-staging area properly is more important than designing the usual applications because of the sheer volume the data-staging area accumulates (sometimes larger than the data warehouse itself).