SQLdep joins Collibra to build the future of Data Lineage Learn more

Determining data lineage: SQL scripts vs documentation

10 September, 2015 10:35:15 AM · Martin

So, here’s the issue: You’re a data analyst in a large company, working with a datawarehouse and you are part of the team responsible for reporting. You have a fairly complex set of SQL queries that set-up regular reporting, compute various metrics like profitability, prepare numbers according to account regulations like US GAAP, and much much more.

Although there are plenty of tools to help you write SQL code efficiently (like Toad, PL/SQL Developer or Oracle's SQL developer), when you have to maintain a large SQL set over time, things get messy. Different reports are executed with different frequencies - daily/weekly/monthly or perhaps quarterly/annually for some financial reports. In those batches you most likely have have queries that go over 1000 lines of code with plenty of logic inside and some heavy usage of conditioning like CASE - WHEN conditions[remove?]. Over time, your system becomes more complex, so when you want to make changes to your SQL, you run into problems. In the worst case, you risk making changes that have carry-on unintended effects.

A question for you, the reader: What Best Practices help you handle this personally, or in your organization?


Data lineage to the rescue

One source of accurate information on the source of your data, is in the SQL code itself. While code makes for great documentation, it’s very time-consuming to manually follow data lineage and perform impact analysis every time you need to make a change. Since understanding the data lineage is highly recommended - let’s focus on ways that you can get the information you need, as quickly and accurately as possible.

Columns view in SQLdep visualizer.


1. Manually maintain your documentation

Simple, right? Every time you change SQL - you update the doc!

Challenges: I have never seen this work (but I’m a sample of one - what’s your experience with this?). Over time, mistakes are compounded until no one trusts it. From what I’ve seen in nearly every organization: coding trumps documentation nearly every time.


2. Hey IT! Build me a datamart!

One reasonable way to get rid of your over complex queries might be to ask IT to build a datamart. When teams of data analysts compute the same data over and over while applying similar logic, the automation that IT provides can be a nice solution. Let’s use a simple example -- your company has different products (eg: personal bank accounts & loans), and profitability for these products is computed from different columns. You need to consider fees, overdues, interest rates and other factors, and it’s likely that each product stores this data in separate columns. Instead of over burdening data analysts’ SQL queries with this complex and yet routine logic, IT can handle it and simplify your SQL queries in the process.

Challenges: You might end up filing a formal IT request for every minor change, and more complex logic may increase the likelihood of needing to make changes in the future.


3. ETL Tools (Extract, Transform, Load Tools)

Incorporate an ETL tool like Pentaho or the fairly expensive tools from Informatica, IBM, and other vendors. If you do it right, everything will be documented, it will be reasonable to maintain, and you can monitor how your scripts run.

Challenges: Deploying ETL tools can take a fair amount of time, with the potential for a large impact on your departments’ budget. Even after IT approval it may take weeks/months of work to migrate your scripts into such a tool. What has your experience been with ETL tools?


4. Generate Up-to-date documentation based on your SQL queries

Often overlooked solution is to use a tool which reverse-engineers your SQL statements and generates documentation from them. This gives you access to data-lineage and therefore allows you to understand what is happening in your scripts. This is easy to implement and helps you to maintain all the queries you have.

Challenges: Typically, such a tool doesn't help you to monitor how your queries run and perform, but you gain instant help and you will have the complete data-lineage quickly.

Queries View in SQLdep visualizer.



We’ve described 4 methods of discovering data lineage. Which (if any) make sense in your situation?


Questions from above:

  • Overall, what best practices help you to understand your data lineage?
  • What are your experiences with datamarts and ETL tools?
  • Have you created documentation by parsing, analyzing, and visualizing your SQL?
  • Is missing data lineage a hidden burden that you have grown accustomed to?


Read a case study

SQLdep saved GE Money twelve months of work. GE Money was searching for ways to make the development and administration of its 20TB data warehouse faster and more efficient.