Endorse Maximize your ROI, minimize your project risk
Consulting
Data Warehouse Best Practices You Can Learn From A Spider

Although I am not a fan of spiders, I do admire them for their resourcefulness and efficiency. The spider’s cleverness if we take
a closer look at is really a set of best practices that they use over and over again with tremendous success. You may be thinking
that sounds fantastic if you are a spider, but we are not in the bug business. Or are we?

Let’s take a look at some of the characteristics of a spider and see if we can apply these best practices to your data warehouse
development methodology.

Flexibility
A spider’s web can catch a gnat, a moth, a mosquito and about anything that moves within reason using the same web. There is
no doubt about a spiders web being flexible, but what about your system?

Years ago when smoke stack vertical systems were built, they were designed for a very specific function and nothing more.
Data for an inventory system never crossed paths with account receivables. In today’s world it is quite the contrary.

Any data warehousing system built today must be designed with flexibility and the future in mind or its shelf life will be limited.
It is safe to say that besides the certainty of death and taxes, change is also inevitable.

So what are the characteristics of a flexible system? Here are a few ideas to consider:


Design
It’s amazing how such a small insect can be so meticulous in creating a free form web with the precision of an engineer without
any design documents. I’ve never met anyone that has created a perfect system off the top of their head. Although it’s possible,
it is very unlikely. For us mortals, we need to stick with the old fashion method of creating a design that will be reviewed and
continually refined until we are satisfied.

But before jumping into design, let us not forget about what we are designing for. The gospel for a system is the requirements.
The requirements consist of the business and technical version of what is being requested versus what is being delivered.

Of utmost importance is the review and understanding of what the requirements are. Clarify any ambiguities; prioritize the
requirements by “must have” versus “nice to have”. This will be helpful when it comes to developing a timeline if you need to
deliver in phases. What you want to ensure takes places before you signoff on the requirements is to perform a Requirements
Traceability. This is basically a mapping of  business requirements to technical requirements down to the lowest level of detail
possible.

What this does for you is guarantees that when your application is developed, missing functionality does not surface towards the
end of your development cycle. Because this is important, I’ve inserted the following illustration as an example. Notice the red
circle with a question mark. This indicates a misaligned set of requirements; specifically a Business Requirement with no
corresponding Technical Requirement. Be on the look out for situations like this.













Reusability
A good way to keep increasing your ROI is to get better use of existing software components. Every time a piece of software
can be reused for another source of data, your ROI goes up.

To what extent is your system components reusable? If you had 5 source inputs of the same type (i.e. customer name and
address), how would you handle this? Somewhere in the processing of this data there must be common logic and that should be
your target for reusability. So when your sixth or hundredth source input comes in, your only concern should be to get the data
to a common format that you can manage.

You can improve you reusability even more in the above example if you can insist that a common input format be used by all
sources.  

Efficiency
Can your application process the prior day’s data and make it available for consumption before the start of the new day? This
becomes an even bigger challenge if you are dealing with real-time data availability.

Systems with low volumes of data are able to manage their routine processing even with great inefficiencies present. This is
simply because the volumes of data are small enough to where inefficiencies can be tolerated.

But what if you were the more typical application where data is growing on a regular basis? My guess is that you could benefit
from some efficiency improvements if they were found. A few places you may want to dig into are your job workflow and input
processing of data.

Processes that run your application most probably have dependencies. Look to see where you can safely remove these
dependencies so data can be processed in parallel as much as possible. If this is not possible, you may want to rethink your
design so data streams can be more independent.

A good portion of your application resources are spent on some form of Extraction, Transformation and Loading (ETL).
Examine what is taking place inside each of these ETL processes. Inefficiencies to look for are how validation of data is
performed. If your application is making several passes of the same data, one for each edit check, you may want to rethink the
logic. Are indexes created on temporary tables for quicker scans or joins? Scrutinize the heck out of your ETL since this will
most likely be your main bottleneck.

Robustness
My spider friends really excel here. Their web can take a real pounding from some pretty large insects in comparison. Even the
smallest of spiders can put together a web than can catch about anything that crawls or flies into their trap. Other than a bird,
not much can pass through a spider’s web and not get tangled.

How robust is your system? Does a little bad data cause your application to fall to its knees? Or does your application just toss
the bad data aside? The worst case of course, is that your application processes the garbage and doesn’t even know it.

An application’s robustness is its ability to handle whatever is thrown at it. This does not mean every situation has special logic.
It does mean that anything that touches the system will be handled according to a set of rules. Valid data is handled as it should
be and anything else is handled by some reject logic. If the rules state to discard questionable data, this is fine if that is what the
business desires. The important point here is that there is no gray area. Data is either valid or invalid and handled accordingly. If
a shade of gray data comes into play, the application should know how to handle it.

Maintainability
This subject is actually a super group of all the other characteristics combined. If any part of your application breaks or needs to
be refined, you never want to be in the position where a tweak turns out to be a major initiative. Your goal is to avoid hard
coding values, modularize wherever possible, maximize reusability, and comment the heck out of your code.

If you follow the preceding best practice objectives, maintainability will come easily. This is analogous to exercising and eating
right will lead to good health. If you stick with the program, just like in health, you will see the results and be in a position to reap
the benefits from all your hard work

Disclaimer
For the record, I am by no means an insect expert so please don’t hold me to any facts on spiders. I used spiders as an example
in the above article due to the spider’s work ethics and their similarities to data warehouse best practices.


Ken Pohl is the president and founder of Endorse Consulting which specializes in minimizing the risk to a project’s ROI by
mitigating potential problems before they arise. His many years of developing and implementing data warehousing,  business
intelligence and CRM systems has provided a wealth of knowledge as proven techniques.  These techniques translate into real
world not from a book “best practices”.