Endorse Maximize your ROI, minimize your project risk
Consulting
How to Build a Data Warehouse

Building a data warehouse has become fairly common in today’s world but with varying degrees of success. The reason for the
mixed results is due to the complexity of building the data warehouse in addition to the level of competency of the participants
involved.  

You may be wondering where the complexity is since many data warehouses have already been built. Although data warehouses
have become common place, similar to finger prints, no 2 data warehouses are exactly the same. If this were not true, we would
have shrink wrapped data warehouses by now. Yes, there are vendors who claim through metadata they can build you a data
warehouse overnight but I have not bought into that paradigm. My guess is going the shrink wrapped route would require some
significant overhead but that is a discussion you can have with your salesman when the time comes.

Data Warehouse Definition
A good starting point to building a data warehouse is to define what your interpretation of data warehouse is. When data
warehouses first began to appear in the technical landscape, it was nothing more than a large data store in the hundreds of
megabyte range. At the time, this was huge! The data warehouse was more of an operational data store (ODS) and hadn’t found
its purpose in life.

As the data warehouses grew into the gigabyte range and OLAP tools began to make better use of the data, data warehouses
began to show promise as a valuable asset to a company. Mix in a few statisticians and now you have a really interesting
opportunity at your door step. Moving into the terabyte range and creating the monolithic data warehouse is where we are today
in most cases. With these advances come a slew of new problems however. The biggest challenge as you may have already
guessed is size. This is a key area why many data warehouses have failed to live up to expectations.

To not get crushed by the sheer size of your data warehouse, avoid building the monolithic data warehouse. There are no first
prizes or blue ribbons awarded that I know of for building the single largest data store. View your data warehouse as a virtual
data warehouse environment where you can partition your data across several servers. In Figure 1 below, the data warehouse
consists of 6 physical servers but 1 logical data warehouse. A virtual data warehouse will enhance your overall performance,
disaster recovery, and scalability. And besides, the real integration of your data takes place in the data marts where your front
end tools can manipulate the data as needed.

Additionally, when defining your data warehouse, look beyond the data warehouse itself. Take into consideration the extended
data warehouse environment and what the requirements for those systems are. The data warehouse can not stand on its own
and must have a provider for data and a consumer of data for the data warehouse to be of value.

Data Warehouse Requirements
Your first step to building your data warehouse is to define the requirements. Keep in mind as with any application, poor
requirements lead to a disastrous result. You want to put as much emphasis and due diligence into your requirements gathering
as possible. Why? Because unlike other applications, the data warehouse will become a multi-purpose resource for your
company’s IT needs.

Imagine that you plan to build your data warehouse for sales history. Someone in marketing at some point will ask about
Customer Relationship Marketing (CRM) for Loyalty programs or targeted marketing. Where do you suppose data will come
from to support the CRM system? Next your Customer Care organization wants data for decision making on how to handle the
irate customers on the phone. Your data warehouse has now become a major artery in your company’s bloodstream.

This may or may not be apparent in the initial requirements because data warehousing is new to your organization or you have a
single sponsor. What you must do if no one else does is, ask some of the tough questions. For example:  

1.        What is our objective for building this data warehouse?
2.        What do we hope to achieve from our data warehouse in the next 12 months?
3.        What do we expect to see from our data warehouse over the next five years?

You want to know the unwritten expectations because this must be factored into you data warehouse architecture and data
warehouse design.

Data Warehouse Architecture
The data warehouse architecture is crucial for an efficient and smooth running data warehouse operation. What do we mean by
“efficient” and “smooth running”?

Efficient implies that volumes of data can be processed on a regular basis without encroaching into the next processing period.
Data goes through the Extraction, Transformation and Loading (ETL) as quickly as possible using the least amount of resources,
rejecting bad data, and logging all event status until all sources of data to be processed are completed.

The data warehouse architecture is the enabler for the ETL process and must have the following properties at a minimum:

•        The ETL staging area must be designed to accommodate concurrent active jobs where conflicts in resources will not arise.
•        The staging area must also be flexible enough to handle data of the same type in different formats.
•        Error handling should be a circular process so corrected data can be fed back into the ETL process seamlessly.
•        An automated mechanism to archive input source data in the event the data needs to be recalled.

Smooth Running comes to mind when you think of a well greased machine. Your ETL processes should run flawlessly as in a
“lights out” environment. If you begin to get data errors or job failures, do not waste a moment to trouble shoot the issue. It is
very easy to become overrun with production problems from a data warehouse architecture that is not sufficiently robust for
peak period processing or ETL that was poorly designed.

In Figure 1 you will note the basic components of a data warehouse architecture. In pink are the key components mentioned
thus far. The two other components that need to be called out are the Job Workflow Scheduler and the Data Warehouse itself.

The Job Workflow Scheduler in simple terms schedules and runs jobs. However, there is much more to a job scheduler than just
running jobs. The robustness of the job scheduler is what stands between you and your bed. After all, this is the brains that will
be running your data warehouse while your home sleeping.

In gold color (rightfully so), is the data warehouse. This is your company’s Fort Knox that contains your customer privacy data,
data for business intelligence and dashboards, competitor data and so on. In the ETL staging area focus was given to cleansing
the data and getting it into the data warehouse. In the data warehouse our focus shifts to managing the data and making it
available to the consumers of data (data marts).

Figure 1 – Data Warehouse Architecture






























Data Warehouse Design
Did you ever wonder what would happen if you could not get the data out of the data warehouse? Poor data warehouse design
will make it near impossible to get at your data in the way you want unless detailed planning is built into the design.

The data warehouse design is indirectly governed by the business requirements or at least it should be. Knowing your data and
how it will be used is an imperative for your data warehouse design. A fundamental question can be asked as far as choice of
schema design; should it be a Star Schema or a Normalized Relational model? The answer to this question depends on whether
you are processing state based data or transactional. Mass updates or just a few.

Another question that will surface is whether reporting will come directly from the data warehouse or will data marts be used?
You spend a lot of effort trying to get the data into the data warehouse and equal attention should be applied towards the usage
of the data. This too will influence your choice of schema design.

Whichever schema design you choose, there are a few important best practices you want to make sure are fundamental to your
design.

•        
Surrogate Keys usage should be maximized in your data warehouse design. Surrogate keys will shield your data      
warehouse from upstream source system key value changes as well as conflicts of duplicates key from multiple sources. If your
requirements specify to keep historical values of your data, surrogate keys will help you here as well.
•        Date Time stamp every record. Not doing this will come back to haunt you anytime there is a choice of which record to
choose where the same values exit. The date time stamp will also be crucial in your management of historical data (slowly
changing data).

•        Maintain a
Batch ID or Process ID for each set of records applied to the data warehouse. If you are using a Star schema,
it will be simple to back out data based on this ID. If you are using a Normalized schema, at a minimum you will be able to easily
identify the effected records.

•        Ensure your
primary keys are numeric. Do not use alpha characters or any type of GUID for this value.

•        Always think
generic when designing subject areas in your schema. For example, if you had customer name and address
data from a single source where they only kept the 5 digit zip code and a second source came into play that had the Zip +4, you
would have a maintenance issue. A little foresight will take you a long way!

•        In almost all cases, you want to avoid adding domain values
dynamically.

•        
Think global. Ensure your data warehouse is double-byte compliant from the very beginning.

•        Avoid the data warehouse from becoming bloated with
useless data. Have a plan for archiving infrequently used data.

The single most critical feature of your data warehouse is data integrity. Implementing best practices like the ones previously
mentioned will help with data integrity, but this is an objective of the data warehouse that you must constantly be vigilant of. If
confidence is ever lost by your users in the data you have, the data warehouse is unlikely to survive. View the data warehouse as
something organic. If you nurture and care for it, it will continue to grow and flourish.

Data Warehouse Consultant
Now that you have a general idea of what is involved in building a data warehouse, you can choose to build it yourself or have
someone do it for you. Either way, to increase the odds of getting it right the first time, you may want to invest in a data
warehouse consultant to endorse your design. The potential benefit could result in a significant increase in your ROI by having a
third party review the requirements and proposed design. Having another set of eyes to help you identify potential shortcomings
and suggest alternatives is priceless.

There are many ways to skin a cat just as there are many ways to build a data warehouse. Coming up with the design that best
suites your needs that is “Efficient and Smooth Running” should be your end goal.

Ken Pohl is the president and founder of Endorse Consulting which specializes in minimizing the risk to a project’s ROI by
mitigating potential problems before they arise. His many years of developing and implementing data warehousing,  business
intelligence and CRM systems has provided a wealth of knowledge as proven techniques.  These techniques translate into real
world not from a book “best practices”.