viernes, 27 de agosto de 2010

Stage or Not to Stage in Data Warehouse


The back room area of the data warehouse has frequently been called the staging area. Staging in this context means writing to disk and, at a minimum, I recommend staging data at the four major checkpoints of the ETL data flow. But, the main cuestion in this note is: When i need to design stage area in my data warehouse project?


To Stage or Not to Stage

The decision to store data in a physical staging area versus processing it in memory is ultimately the choice of the ETL architect. The ability to develop efficient ETL processes is partly dependent on being able to determine the right balance between physical input and output (I/O) and in-memory processing.

The challenge of achieving this delicate balance between writing data to staging tables and keeping it in memory during the ETL process is a task that must be reckoned with in order to create optimal processes. The issue with determining whether to stage your data or not depends on two conflicting objectives:


- Getting the data from the originating source to the ultimate target as
fast as possible

- Having the ability to recover from failure without restarting from the beginning of the process


The decision to stage data varies depending on your environment and business requirements. If you plan to do all of your ETL data processing in memory, keep in mind that every data warehouse, regardless of its architecture or environment, includes a staging area in some form or another.Consider the following reasons for staging data before it is loaded into the data warehouse:

- Recoverability. In most enterprise environments, it’s a good practice to stage the data as soon as it has been extracted fromthe source system and then again immediately after each of the major transformation steps, assuming that for a particular table the transformation steps are significant. These staging tables (in a database or file system) serve as recovery points. By implementing these tables, the process won’t have to intrude on the source system again if the transformations fail. Also, the process won’t have to transform the data again if the load process fails. When staging data purely for recovery purposes, the data should be stored in a sequential file on the file system rather than in a database. Staging for recoverability is especially important when extracting from operational systems that overwrite their own data.

- Backup. Quite often, massive volume prevents the data warehouse from being reliably backed up at the database level.We’ve witnessed catastrophes that might have been avoided if only the load files were saved, compressed, and archived. If your staging tables are on the file system, they can easily be compressed into a very small footprint and saved on your network. Then if you ever need to reload the data warehouse, you can simply uncompress the load files and reload them.

- Auditing. Many times the data lineage between the source and target is lost in the ETL code. When it comes time to audit the ETL process, having staged data makes auditing between different portions of the ETL processes much more straightforward because auditors (or programmers) can simply compare the original input file with the logical transformation rules against the output file. This staged data is especially useful when the source system overwrites its history. When questions about the integrity of the information in the data warehouse surface days or even weeks after an event has occurred, revealing the staged extract data from the period of time in question can restore the trustworthiness of the data warehouse.


Once you’ve decided to stage at least some of the data, you must settle on the appropriate architecture of your staging area. As is the case with any other database, if the data-staging area is not planned carefully, it will fail. Designing the data-staging area properly is more important than designing the usual applications because of the sheer volume the data-staging area accumulates (sometimes larger than the data warehouse itself).

jueves, 19 de agosto de 2010

What's Essential -- And What's Not -- In Big Data Analytics (Columnar Data Base?)


Far from arguing over the benefits (or drawbacks) of a column-based architecture, shops would be better advised to focus on other, potentially more important issues. Row- or column-based engines marketed by Aster Data, Dataupia, Greenplum Software Inc. (now an EMC Corp. property), Hewlett-Packard Co. (HP), InfoBright, Kognitio, Netezza, ParAccel, Sybase Inc. (now an SAP AG property), Teradata, Vertica, and other vendors (to say nothing of the specialty warehouse configurations marketed by IBM, Microsoft, and Oracle) are by definition architected for Big Analytics.

Analytic database vendors today compete on the basis of several options -- capabilities such as in-database analytics, support for non-traditional (typically non-SQL) query types, sophisticated workload management, and connectivity flexibility.

Every vendor has an option-laden sales pitch, of course -- but few (if any) stories are exactly the same. In-database analytics is particularly hot, according to Eckerson. All analytic database vendors say they support it (to a degree), but some -- such as Aster Data, Greenplum, and (more recently) Netezza, Teradata, and Vertica -- seem to support it "more" flexibly than others.

"With in-database analytics, scoring can execute automatically as new records enter the database rather than in a clumsy two-step process that involves exporting new records to another server and importing and inserting the scores into the appropriate records," he explains.

The twist comes by virtue of (growing) support for non-SQL analytic queries, chiefly in the form of the (increasingly ubiquitous) MapReduce algorithm. Aster Data and Greenplum have supported in-database MapReduce for two years; more recently, both Netezza and Teradata, along with IBM, have announced MapReduce moves. Last month, open source software (OSS) data integration (DI) player Talend announced support for Hadoop (an OSS implementation of MapReduce) in its enterprise DI product. Talend's MapReduce implementation can theoretically support in-database crunching in conjunction with Hadoop-compliant databases.

"[T]echniques like MapReduce make it possible for business analysts, rather than IT professionals, to custom-code database functions that run in a parallel environment," he writes. As implemented by Aster Data and Greenplum, for example, in-database MapReduce permits analysts or developers to write reusable functions in many languages (including the Big Five of Python, Java, C, C++, and Perl) and invoke them by means of SQL calls.

Such flexibility is a harbinger of things to come, according to Eckerson. "[A]s analytical tasks increase in complexity, developers will need to apply the appropriate tool for each task," he notes. "No longer will SQL be the only hammer in a developer's arsenal. With embedded functions, new analytical databases will accelerate the development and deployment of complex analytics against big data."

miércoles, 4 de agosto de 2010

An Agile BI Program

One of the biggest misconceptions about agile is that it is about getting more done faster. This is simply false. It is about delivering the right things of value, with a high degree of quality and in small iterations. The word "more" should be dropped. It is about avoiding waste or "mudda" when creating value. Have you ever delivered something that took a long time to build, only to have it never be used? Ask yourself why that was the case. This is the waste that we seek to avoid.

One of the biggest difficulties is to get the heads of business users and technical teams wrapped around thinking iteratively. Remember that delivering in small bits with communication built into the process is a foreign concept to many. Most people are equipped to deal with big bang and are unsure how to engage with a process that requires constant communication and participation. Others are simply afraid that once you deliver something you will never be seen again, so they ask for everything at requirements-gathering sessions

Delivering small has other benefits. We can avoid bottlenecks in the process by completing smaller chunks of work and by keeping all points in a process continuously busy as opposed to having too many wait states. This makes it is easier to test and demonstrate. The biggest benefit is that it gets "something" of value into production quicker, versus keeping valuable assets on the shelf in development. If value can be derived, get it into production as soon as your cycles allow. I purposefully refer to the outputs of BI development as assets, and they should be managed as such.

One of the biggest benefits is that when you fail, you fail fast. This is a good thing in that you demonstrate progress periodically and can ask your business users: "Is this what you wanted?" If it is not, you have wasted less development time that you would have under other methodologies (such as waterfall). However there are perception issues with this as failing is generally considered "bad." This is true only if you never learn from failures. If you incorporate continuous improvement ceremonies (such as regular start, stop, and continue sessions), this misperception can be mitigated.

Agile is well suited to data warehouse development because requirements are often difficult to gather for BI applications. This is the nature of BI, coupled with the fact that BI teams are often not properly staffed with dedicated business analysts. Such inherent challenges make adhering to the agile process beneficial. Prototyping, demonstrating, and communicating all help shape requirements over time by showing working models that can be used to illicit feedback.

A word of caution: No process will fully make up for poor requirements gathering.

Architect big and deliver small must be an overarching principle. It is one thing to deliver in small iterations, but you should have some idea of what your end state should look like at the program level. This obviously is the "architect big" part of the principle. The "deliver small" part comes from the agile cycles and data models are part of these cycles.

Be Prepared for a Journey

Agile is a process and like any process, it can have resistors. It takes time to hit your stride, so be patient. Agile takes time to implement. Having someone on your team that has been part of a successful agile development process will certainly help. If you do not have any experience, look for a coach who can help guide you through the process up at least get the team trained.

Either way, it is an amazing journey, and it is rewarding to watch a process mature and improve.

martes, 20 de julio de 2010

Mobile Business Intelligence Reporting

From a sociological perspective, users are becoming more comfortable with their phone’s ergonomics and multitude of features, and are using them as full-functioning mobile computers. Phones and laptops are becoming interchangeable. Initial evidence of this convergence is the large volume of e-mails sent from BlackBerrys and other mobileWindows-enabled smartphones, as well as the proliferation of CRM mobile applications. Also, phones have an advantage over laptops because they can be carried anywhere and used anytime – 24 hours a day, seven days a week. They don’t require mobile hot spots or other Internet connections and with Bluetooth they can be easily connected to printers and other peripherals making almost the entire office portable.






Mobile browsers now provide the same functionality of desktopWeb browsers so users get a consistent experience regardless of device. More people are searching theWeb, reading news, watching streamed TV, accessingWeb applications, and making transactions on their phone. this trend continues business is driven to evolve. Google, for example, recognized the increased use of mobile devices as a medium forWeb browsing and made its search tool and productivity applications (Google Apps) available on mobile phones, setting the benchmark for usability.

Smartphones are also forcing a shift in the paradigm of how information technology (IT) groups work. There are currently 1.5 billion phones in use around the world. By 2011 half of the world’s population will have mobile phones – 50 percent of which will be smartphones. This change clearly indicates that enterprises have to embrace smartphones as a primary form of communication. IT groups – for the first time in their history – have to adapt to consumer requirements instead of dictating their own agenda. If consumers can now access their Gmail on phones, why not access corporate apps too?


Improvements in Productivity

Economic gains from enabling mobile reporting are irrefutable. Currently one out of seven e-mail users is also a mobile e-mail user, having a BlackBerry or another smartphone. Early adopters,mainly executives, have seen measurable increases in productivity by being able to:

- Work during times otherwise wasted, such as while waiting at airports and before meetings
- Respond immediately to urgent messages
- Be avalable to and connected with other key decision-makers 24/7


Gains in productivity outweigh the expense of mobile devices and applications – an estimated fixed cost of $2,500 per mobile user. A low-cost mobile BI solution that does not require additional infrastructural investments such drives up the per-user return on investment (ROI). Furthermore, as mobile computing spreads through the ranks to all employees, the ROI increases exponentially.

According to Gartner analysts Steve Kleynhans, “Most IT organizations are ill prepared to deal with this new environment in which users drive technology.” IT groups are often (and in many cases justifiably) leery of new technologies. Knowing the difficulties inherent in implementing unproven solutions, many would prefer to wait for other companies to provide successful case studies with clear user benefits. Yet, waiting until this technology becomes mainstream means missing out on years of productivity gains.


Dashboards for Everyone



The sheer volume of information available, however, means users risk information overload. Dashboards have emerged as a concise way to visualize information. Instead of analyzing multiple reports and the relationships between them, a dashboard offers an analytical perspective. All relationships and associated measures are presented in a single, prepackaged view. The key obstacle to mass use of mobile dashboards is the small screen on the device as well as the requirement to be connected to the dashboard infrastructure. Two trends are changing this:

Better, larger screens with higher resolution are becoming popular, as on the iPhone, HP hybrid devices, and Nokia business phones. And, better browsers with advanced zoom functions, touch screen navigation, and interaction enhancers – such as zoom drop boxes for easier selection – display content in a useful way similar to dashboard displays.





Active Dashboards can be distributed to anyone – on any device – either via e-mail, via the My Mobile Favorites launch page or by posting them on theWeb, and users can interact with them online or offline.

lunes, 28 de junio de 2010

Agile BI (Business Intelligence) Basics

What is Agile BI?

Cindi Howson: The Agile Manifesto was first published in 2001 by a group of software engineers (see agilemanifesto.org) trying to improve the software development process and customer satisfaction. There are 12 principles, but the six that most apply to BI are:

• Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.
• Welcome changing requirements, even late in development. Agile processes harness change for the customer's competitive advantage.
• Business people and developers must work together daily throughout the project.
• The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.
• Simplicity -- the art of maximizing the amount of work not done -- is essential.
• The best architectures, requirements, and designs emerge from self-organizing teams.


Who is using agile development and how important is it?

I do get the sense that more innovative companies are using agile development, but I have also seen it in established manufacturing companies. It is less well suited to companies that have outsourced BI because it makes it harder to build things to a specification. Then again, I’m not a supporter of outsourcing for BI.

Agile BI emerged as a common theme among successful BI case studies when I began researching my book Successful Business Intelligence in 2007, so last year, we included this data point in the survey. Overall, agile development was identified as being not that important.



Agile sounds like the Wild West of BI with no requirements, no documentation.

If you are used to having everything highly documented with requirements precisely defined, then less formal requirements definition can seem like the Wild West. The difference is in the how and degrees. Requirements are still gathered, but perhaps through rapid prototyping, collaboratively, rather than the business writing out their specifications before they can look at any results.

martes, 1 de junio de 2010

Business Intelligence First Steps

Many small and mid-sized enterprises (SMEs) are looking for the best business intelligence (BI) solution to address their specific business problems. Whether these business pains are putting out regular fires, managing a sales force, increasing customer satisfaction, or gaining more visibility into the business and data, business intelligence is becoming the buzzword used to identify the solution used to address these problems.

Unfortunately, business intelligence on its own is not the answer to solving an organization’s business problems. The ability to effectively solve issues and develop a successful BI infrastructure depends upon the combination of the people involved and the business processes put in place. Although there are no surefire ways to ensure project success, there are things that SMEs can do and take into account when looking at starting their BI initiatives.

This article explores the first steps that SMEs should take in order to work toward BI success. The key factors that organizations must consider when looking to use BI to solve business problems and gain visibility into their business are:


1.Defining the right scope
2.Identifying, using, and managing the right data
3.Engaging the right people
4.Integrating proper project planning and management practices

Defining BI Project Success

As mentioned, the four areas listed above do not guarantee project success. However, careful consideration of these items gives companies a way to start any BI project on the right foot and put the processes in place that are required to grow and maintain a strong BI infrastructure and front-end analytics and reporting solution. Because there is so much to consider when looking at any hardware- and software-related project that deeply affects how people do business on a daily basis, taking a step back and identifying individual aspects helps simplify initiatives that require the collection of many complicated and diverse business and technical requirements.

1. Defining the Right Scope – Answering the Right Questions
The first step in any BI project is to identify the business problem. In some cases, organizations want business intelligence to solve all of their problems at once. Obviously, one of BI’s advantages is the ability to consolidate large amounts of disparate data to help companies gain a broader view of what is happening within the company. However, when looking at solving business issues and aligning strategic goals with business performance, doing it well outweighs doing it fast. Therefore, companies should identify their main business pain and start building their solution around that issue to identify general goals and metrics associated with performance management. By developing a targeted scope that addresses key business issues and starting small, organizations can work toward building a solution that meets the needs of many departments within the organization based on incremental success.

2. Identifying, Using and Managing the Right Data – Turning Data into Information
Once the scope is defined, businesses can look at what information is required. This means looking at where data resides, who accesses that data – both operationally or analytically, how often it is updated, how often it is required for reporting and analytics, the types of business rules that exist, what hardware and software it runs on, and what gaps currently exist in relation to analytics or general visibility. Although it is important to start small, organizations can identify all of the information required for the data warehouse because it is easier to identify all required data sources up front to lessen the time spent on integration activities over time.

The type of data and systems currently in use will affect the overall solution choice. Depending upon integration requirements, some solutions integrate specific types of data or information from source systems more easily, while others offer robust linking, matching and data profiling that can help with complicated data reconciliation efforts or merging various business practices into a single data warehouse. Although not always seen as important to business users, the ability to maintain data integrity on a continual basis will help ensure accurate data visibility and better decision making over time.

3. Engaging the Right People – Enter the Stakeholders
Without proper input from the people who own the data and interact with the data, there is the potential to miss key requirements when looking at developing a BI solution. Every person interacts with information differently depending upon his or her role within the company. Consequently, the requirements gathered can make the difference between project success and a solution that no one uses. To make sure that general buy-in occurs, it is important to include the relevant stakeholders in the process. Stakeholder involvement will be different in each company as many different business functions may interact with financial or sales data, or have input related to employee performance.

4. Integrating Proper Project Planning and Management Practices – Back to the Basics with Project Management
Even though not all companies use formal project management tools to manage software selection initiatives, managing projects requires some sort of formalized approach. Tracking stages and managing dependencies throughout the project life cycle helps identify whether everything is on track, how delays will affect future activities, and if the project will be completed within the proposed time frame and budget. The success of a project should not be measured by only identifying whether a project finishes on time and within budget, but implementing BI for the first time within defined parameters helps ensure support for future expansions. Within a BI environment, there are constant projects to enhance and expand solution use because of the benefits seen by companies as they begin to interact with their reporting and analytics environment.

Building BI Step by Step
These four aspects provide guidelines for organizations at the beginning of a BI project and can help lead to a greater chance of project success. Overall, organizations should realize that implementing BI requires business, technical, people, and process considerations and that any gap in one of those areas will create a hole in the overall project. Even if the first implementation breeds success, the continual use and expansion of business intelligence depends upon the cohesion of these four areas.

The ability to define and limit an initial project scope, include stakeholders within the requirements-gathering phase, and manage the project using a defined framework all fall into the areas of business, people, and processes. BI infrastructure and identifying data and how it will interrelate usually provide the bulk of what goes into preparing a BI initiative for the first time. And even though technology requirements are very important within any BI project (especially when looking at data warehousing for the first time), it is also essential not to overlook the business, people and process areas as they become a greater influence as business intelligence use starts to expand within the organization.

miércoles, 26 de mayo de 2010

Open Source BI Solutions: a Low TCO Prospect

Business intelligence is a vital component for successful business management. It introduces capabilities for effective decision-making, resulting in higher income and increased growth for the organization. BI programs must keep strategic goals and organizational missions in mind, while reducing the cost and time of implementing solutions.

Open source BI may be evaluated against the parameters of total cost of ownership, performance, scalability and user requirements.

Why is Business Intelligence important to an Organization?

Organizations can make intelligent decisions when timely information is consistently made visible to decision-makers at all levels, as this endows them with the ability to monitor important drivers of organizational performance. A well-designed BI system collects the organization’s operational data from different sources, presenting it to decision makers and stakeholders simply and meaningfully through use of a user-friendly tool. A good BI solution helps organizations gain better insight into their businesses, improve decision-making and optimize enterprise performance.


An Open Source BI Overview

Open source BI has come a long way compared to other commercial BI products, and is becoming widely recognized as an important component for enterprise-level applications. Open source BI projects such as Pentaho and Jasper have evolved from community-driven tools to viable technology with professional support for enterprise-wide adoption and witness growing demand. Organizations can use open source BI software to replace custom-coded applications. The open source BI tools can also be considered for BI components that complement the existing proprietary solution to reduce license cost. Because organizations are not locked into proprietary vendor’s platforms, open source enlarges organizational flexibility.

TCO: Critical Factor in Implementing BI Solutions

While few will deny the importance of BI, the most important factor to be considered for BI is the total cost of owning the application. The TCO concept measures costs related to the acquisition of a BI solution, its deployment and ongoing use. Though TCO estimation methods vary for different BI implementations based on requirements and business needs, certain proportions may be assumed to calculate the TCO in most projects. For example, typically staffing costs account for 50 percent while the hardware costs account for 8 percent of the project value. Based on market trend reports published by leading industry analyst organizations, the cost breakdown in Figure 1 may be assumed as the TCO breakdown for most BI implementations.

Open source helps reduce TCO on all the parameters in Figure 1. Open source BI helps in reducing costs and risks for prospective BI users. Though this does not suggest that open source BI is the right choice for every organization in every BI deployment, it can be used as an alternative for reducing BI costs if it satisfies user requirements.





Major TCO Components

Hardware: This covers the cost incurred in procuring the hardware throughout the organization, including all client machines, servers, storage solutions and networking devices attached to servers. As most software licenses are based on the number of CPUs, it directly impacts the cost of hardware. Using a scale-out approach, low-cost servers can be used to deliver open source BI solutions.

Software: The cost related to the software is one of the significant factors in the overall TCO. Open source BI is available at a fraction of cost as compared to commercial products. Open source BI customers have the flexibility to choose the components and their support level according to the requirements of the end users.

Staffing: Staffing constitutes 40 percent to 50 percent of the BI application’s cost, including the cost of resources during the analysis, development and maintenance phases. The ability of the vendor to provide the documentation and technology of expert resources makes a big impact on the TCO. Open source is based on public standards and public domain technologies.


Selection Criteria for Open Source BI Solution
Though TCO is the commonly accepted financial measure for evaluating the BI solution, factors such as user requirements, complexity of development and scalability of the solution have to be analyzed to perform the TCO calculation. The five factors that affect a BI solution are:


•BI product selection and user requirements,
•Complexity of development,
•BI project timelines,
•Product support and third party support and
•Performance and scalability.

These points can be used to compare the open source BI solution with proprietary vendor’s solutions.

BI product selection and user requirements. The objective of collecting BI user requirements is to establish the outcome of the BI solution and other aspects of the projects relating to time, cost and resources. In the case of open source BI solutions, organizations can verify the requirements without contacting the product company, because organizations can initiate a proof of concept and refine the requirements without buying the BI tools.

Complexity of development. Developing a BI solution is not only dependent on the user requirements but also on the product features and technology. As compared to proprietary vendors, open source BI products are based on the technologies available in the public domain. The resources for developing and maintaining applications are easily available. Most open source BI solutions allow a design approach in which a prototype can be done rapidly with regular testing and feedback from the BI users.

BI project timelines. Any BI solution requires orchestrated efforts by the team to complete the solution on time. Selection of proprietary and open source technology affects the human cost. While considering the open source tool, organizations must consider developers and supporting people such as database administrators and testers to understand and learn the technology. Open source BI products have simplified the use of tools and added features that can reduce the development timelines.

Product support and third party support. All open source companies provide support for the products at very low subscription prices compared to proprietary vendors. A systems integration partner is usually brought in to support the solution.

Performance and Scalability

The performance of the BI solution is dependent on factors such as data source performance, server hardware, content complexity and user requests. Most open source BI solutions support scale-up and scale-out architectures and can scale linearly.

Integration with existing infrastructure. Open source BI solutions provide a comprehensive integration interface wherein customization can integrate with the existing infrastructure. Also, this solution can be embedded on compliant servers. Information like cubes and report can be integrated through XML, HTML or JSR-168 portlets. Open source solutions are compatible with multiple operating systems.

End users and supporting personnel training. End users are business people who understand business terms. Open source solutions have the capability to put up a semantic layer that hides the complexity of the data and allows end users to exploit information using business metadata. It removes the necessity for end users to learn the coding language or syntax related to products. System integrators or open source BI companies can provide training to the support team when the BI solution moves into production.