viernes, 27 de agosto de 2010

Stage or Not to Stage in Data Warehouse


The back room area of the data warehouse has frequently been called the staging area. Staging in this context means writing to disk and, at a minimum, I recommend staging data at the four major checkpoints of the ETL data flow. But, the main cuestion in this note is: When i need to design stage area in my data warehouse project?


To Stage or Not to Stage

The decision to store data in a physical staging area versus processing it in memory is ultimately the choice of the ETL architect. The ability to develop efficient ETL processes is partly dependent on being able to determine the right balance between physical input and output (I/O) and in-memory processing.

The challenge of achieving this delicate balance between writing data to staging tables and keeping it in memory during the ETL process is a task that must be reckoned with in order to create optimal processes. The issue with determining whether to stage your data or not depends on two conflicting objectives:


- Getting the data from the originating source to the ultimate target as
fast as possible

- Having the ability to recover from failure without restarting from the beginning of the process


The decision to stage data varies depending on your environment and business requirements. If you plan to do all of your ETL data processing in memory, keep in mind that every data warehouse, regardless of its architecture or environment, includes a staging area in some form or another.Consider the following reasons for staging data before it is loaded into the data warehouse:

- Recoverability. In most enterprise environments, it’s a good practice to stage the data as soon as it has been extracted fromthe source system and then again immediately after each of the major transformation steps, assuming that for a particular table the transformation steps are significant. These staging tables (in a database or file system) serve as recovery points. By implementing these tables, the process won’t have to intrude on the source system again if the transformations fail. Also, the process won’t have to transform the data again if the load process fails. When staging data purely for recovery purposes, the data should be stored in a sequential file on the file system rather than in a database. Staging for recoverability is especially important when extracting from operational systems that overwrite their own data.

- Backup. Quite often, massive volume prevents the data warehouse from being reliably backed up at the database level.We’ve witnessed catastrophes that might have been avoided if only the load files were saved, compressed, and archived. If your staging tables are on the file system, they can easily be compressed into a very small footprint and saved on your network. Then if you ever need to reload the data warehouse, you can simply uncompress the load files and reload them.

- Auditing. Many times the data lineage between the source and target is lost in the ETL code. When it comes time to audit the ETL process, having staged data makes auditing between different portions of the ETL processes much more straightforward because auditors (or programmers) can simply compare the original input file with the logical transformation rules against the output file. This staged data is especially useful when the source system overwrites its history. When questions about the integrity of the information in the data warehouse surface days or even weeks after an event has occurred, revealing the staged extract data from the period of time in question can restore the trustworthiness of the data warehouse.


Once you’ve decided to stage at least some of the data, you must settle on the appropriate architecture of your staging area. As is the case with any other database, if the data-staging area is not planned carefully, it will fail. Designing the data-staging area properly is more important than designing the usual applications because of the sheer volume the data-staging area accumulates (sometimes larger than the data warehouse itself).

jueves, 19 de agosto de 2010

What's Essential -- And What's Not -- In Big Data Analytics (Columnar Data Base?)


Far from arguing over the benefits (or drawbacks) of a column-based architecture, shops would be better advised to focus on other, potentially more important issues. Row- or column-based engines marketed by Aster Data, Dataupia, Greenplum Software Inc. (now an EMC Corp. property), Hewlett-Packard Co. (HP), InfoBright, Kognitio, Netezza, ParAccel, Sybase Inc. (now an SAP AG property), Teradata, Vertica, and other vendors (to say nothing of the specialty warehouse configurations marketed by IBM, Microsoft, and Oracle) are by definition architected for Big Analytics.

Analytic database vendors today compete on the basis of several options -- capabilities such as in-database analytics, support for non-traditional (typically non-SQL) query types, sophisticated workload management, and connectivity flexibility.

Every vendor has an option-laden sales pitch, of course -- but few (if any) stories are exactly the same. In-database analytics is particularly hot, according to Eckerson. All analytic database vendors say they support it (to a degree), but some -- such as Aster Data, Greenplum, and (more recently) Netezza, Teradata, and Vertica -- seem to support it "more" flexibly than others.

"With in-database analytics, scoring can execute automatically as new records enter the database rather than in a clumsy two-step process that involves exporting new records to another server and importing and inserting the scores into the appropriate records," he explains.

The twist comes by virtue of (growing) support for non-SQL analytic queries, chiefly in the form of the (increasingly ubiquitous) MapReduce algorithm. Aster Data and Greenplum have supported in-database MapReduce for two years; more recently, both Netezza and Teradata, along with IBM, have announced MapReduce moves. Last month, open source software (OSS) data integration (DI) player Talend announced support for Hadoop (an OSS implementation of MapReduce) in its enterprise DI product. Talend's MapReduce implementation can theoretically support in-database crunching in conjunction with Hadoop-compliant databases.

"[T]echniques like MapReduce make it possible for business analysts, rather than IT professionals, to custom-code database functions that run in a parallel environment," he writes. As implemented by Aster Data and Greenplum, for example, in-database MapReduce permits analysts or developers to write reusable functions in many languages (including the Big Five of Python, Java, C, C++, and Perl) and invoke them by means of SQL calls.

Such flexibility is a harbinger of things to come, according to Eckerson. "[A]s analytical tasks increase in complexity, developers will need to apply the appropriate tool for each task," he notes. "No longer will SQL be the only hammer in a developer's arsenal. With embedded functions, new analytical databases will accelerate the development and deployment of complex analytics against big data."

miércoles, 4 de agosto de 2010

An Agile BI Program

One of the biggest misconceptions about agile is that it is about getting more done faster. This is simply false. It is about delivering the right things of value, with a high degree of quality and in small iterations. The word "more" should be dropped. It is about avoiding waste or "mudda" when creating value. Have you ever delivered something that took a long time to build, only to have it never be used? Ask yourself why that was the case. This is the waste that we seek to avoid.

One of the biggest difficulties is to get the heads of business users and technical teams wrapped around thinking iteratively. Remember that delivering in small bits with communication built into the process is a foreign concept to many. Most people are equipped to deal with big bang and are unsure how to engage with a process that requires constant communication and participation. Others are simply afraid that once you deliver something you will never be seen again, so they ask for everything at requirements-gathering sessions

Delivering small has other benefits. We can avoid bottlenecks in the process by completing smaller chunks of work and by keeping all points in a process continuously busy as opposed to having too many wait states. This makes it is easier to test and demonstrate. The biggest benefit is that it gets "something" of value into production quicker, versus keeping valuable assets on the shelf in development. If value can be derived, get it into production as soon as your cycles allow. I purposefully refer to the outputs of BI development as assets, and they should be managed as such.

One of the biggest benefits is that when you fail, you fail fast. This is a good thing in that you demonstrate progress periodically and can ask your business users: "Is this what you wanted?" If it is not, you have wasted less development time that you would have under other methodologies (such as waterfall). However there are perception issues with this as failing is generally considered "bad." This is true only if you never learn from failures. If you incorporate continuous improvement ceremonies (such as regular start, stop, and continue sessions), this misperception can be mitigated.

Agile is well suited to data warehouse development because requirements are often difficult to gather for BI applications. This is the nature of BI, coupled with the fact that BI teams are often not properly staffed with dedicated business analysts. Such inherent challenges make adhering to the agile process beneficial. Prototyping, demonstrating, and communicating all help shape requirements over time by showing working models that can be used to illicit feedback.

A word of caution: No process will fully make up for poor requirements gathering.

Architect big and deliver small must be an overarching principle. It is one thing to deliver in small iterations, but you should have some idea of what your end state should look like at the program level. This obviously is the "architect big" part of the principle. The "deliver small" part comes from the agile cycles and data models are part of these cycles.

Be Prepared for a Journey

Agile is a process and like any process, it can have resistors. It takes time to hit your stride, so be patient. Agile takes time to implement. Having someone on your team that has been part of a successful agile development process will certainly help. If you do not have any experience, look for a coach who can help guide you through the process up at least get the team trained.

Either way, it is an amazing journey, and it is rewarding to watch a process mature and improve.