The dark side of generated code. Is there a new hope?

28 October, 2015 10:43:14 AM · Martin

Stopping by the booths at Strata-Hadoop, I was struck by how many tools out there generate code. The idea is that non-technical users can create data flows with visual tools, and the nasty complexity of code is hidden from them. In theory, this sounds great. You save time, and, in the short-term at least, development resources.

However, there are serious limitationsto this approach that may not be obvious at first. For example, what if a complex query returns totally unexpected result? Zero orders from New York city in 2013? 5,000 customers under the age of 21 in Nebraska? These could be important indicators, or they could be caused either by outliers in the data, or some anomaly in the query. How do you figure out the difference?

Also, a user might find that a query simply stops working as expected--only returning errors. Data sources may be moving around in the back office or data warehouse, resulting in queries to nowhere.

The obvious response is that somebody needs to analyze the data sources and the ‘data lineage’. Data lineage is the path that the data follows through a series of operations that select, combine, sort, and derive results. Often, you can only understand your business intelligence if you understand your data’s path through life.



There are a couple of approaches to doing this today.

1. You can use the ‘platform native’ data lineage tools provided by one of the many ETL solutions available. But, they might not provide a great visualization, they tend to be really expensive, and their UX tends to be less than delightful. Also, these tools require developers to adhere to strict development standards. If they deviate, then chunks of your data lineage disappear into black holes.

2. Sometimes you can rely on a metadata manager to understand the lineage of your data. But metadata managers are usually manually populated by developers. Therefore, their contents may not reflect the reality of what is really happening. What you really need is a ‘source of truth’--you need to open the hood on the code that is performing the complex query.

3. You don’t want to attempt that manually, unless you have infinite time, and infinite resources. Instead, grab SQLdep. SQLdep will analyze your generated code instantly, showing you and your non-technical users what is going on in that ‘black box’ generated by your BI tool. This gives you the ability to maximize your ETL investment and to rely even more heavily on generated code, since with SQLdep you gain the ability to peer into the inner workings of the code, and fix thing where necessary.

So the choice of an automated data lineage tool does not boil down to a question of whether you hand craft your SQL, or whether you use a tool that generates it for you. In both cases, you can experience time and cost savings, and far fewer headaches, when interacting with your non-coder BI colleagues by using SQLdep.

Read a case study

SQLdep saved GE Money twelve months of work. GE Money was searching for ways to make the development and administration of its 20TB data warehouse faster and more efficient.