Sunday, December 2, 2012

*Un*expected traps in an ETL Project

I am writing this post as a supplement to this post by Bjørn Eilertsen. Once we get comfortable with SSIS, we tend to assume we can handle any ETL project with ease. Especially when we have already developed quite a few of the reusable modules that would save us significant amount of time. Thus we would go lower on our estimates in order to win the contract and impress the client. Though it is true that with good experience with SSIS it would lower the development time of an ETL project, but there are some factors that has nothing to do with SSIS but are the main reason to delays and development hours shot off the chart. Besides listing the issues I would also point out a few ways to minimize if not avoid the delay.   Data/Environment access: As a consultant when we comes in to our client and work on a project, their data (or sometimes their client's data) can sometimes be sensitive and may not be available to us until we have completed a long list of task in order to gain access to it. Now, there is no guarantee of how long this may take so start early! This is the first of the many traps in an ETL project so the earlier this is taken care of, the more time there is to deal with the other potential issues.   Data Schema: When the sample data is not immediately available the client may try to appease us by giving us the data schema and claiming that the sample data will strictly follow the schema. It would be nice if it does, but it definitely would avoid a lot of time wasting if there is a plan B should the sample data deviates from the received schema.   Data Quality: In a perfect world, all data is clean. Just like in a perfect society everyone is living happily and there will be no crime. We all know that is impossible. Be prepared to spend some quality time to assure the quality of data. If the SSIS packages have implemented data control and spits out the data rows that contains unexpected data, you are half way done. It may take from a few communication mails with the person responsible of the source data to a month of sitting down to figure out how the data entities are connected. Connection between data entities are tricky especially when they come from different sources and they should not be taken lightly.   Data Model: The destination data model may seem perfectly under control of the SSIS developer and is definitely possible to define prior to the start of the project. However, should there be any changes to this, then there is a risk of facing all the challenges listed above again. The risk of this is high as the data model is usually used to serve as the base of data analysis/mining or an application that utilizes the data. So a glitch in the specification process will likely cause a change in the data model.   Most of these potential traps are difficult to avoid, but it is no point to feel discouraged. I believe it is because of these traps that make the existence of an SSIS developer worthwhile. With careful planning, accurate estimate and a good plan B, the traps would be able overcome with relatively little or no pain.

No comments:

Post a Comment