Part of the joy of working as an ETL developer is that you get to spend a lot of time testing your code by validating large amounts of data. We have a testing tool with several built-in tests (which I maintain), but we periodically discover a new scenario which needs to be added to our toolbox. The fastest way I’ve found of mocking up these tests is with dynamic SQL using the INFORMATION_SCHEMA views.
Firstly, what are the INFORMATION_SCHEMA views? These are an ISO standard way of querying database metadata that stays relatively static throughout the life of the DBMS. There can be improvements and changes in the underlying system tables, but INFORMATION_SCHEMA should remain relatively unchanged. I like them because they bring together information from various system tables and views such as sys.indexes or sys.tables in a way that doesn’t require me to write the join logic as often. For mocking up a test query, they re fantastic.
Let’s say that you have a column in most of the tables in your database called ETL_PROCESSED_TIMESTAMP that does what it says on the tin, and an ACTIVE_ROW_FLAG that tells you which row is the most recent look for that entity. You’ve discovered that Randy the unlucky intern accidentally deleted a row in one of your ETL control tables on the development server, and some of your test loads may not have any active rows loaded for that cycle. Randy doesn’t remember touching that table, and you tested fifteen table load packages over the past week. How do you go about determining which tables are affected?
You could write fifteen queries to see if you have any active rows that were touched during the latest load, but that’s a pain with three tables, and gets worse as it scales up. Also, it took you a week to discover this, and it could potentially happen again in the future. A better solution is to generate the query you need automatically, which will make it easy to stick in a testing tool – no custom code needed, just plug in some T-SQL.
That’s where the INFORMATION_SCHEMA views come in. We can make use of INFORMATION_SCHEMA.COLUMNS to find tables where we have both an ETL_PROCESSED_TIMESTAMP and an ACTIVE_ROW_FLAG, then build a query based off that and INFORMATION_SCHEMA.TABLES to tell us when there are no new active rows in a table. The first place to start is finding tables with the interesting columns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
That’s a good start to any testing query. Looking at the raw output can let you determine where you need to make any modifications, such as only querying a certain schema, or tables with a certain naming scheme. For my purposes, I will assume I want to work with everything I see here, so the next step is to write some dynamic SQL that will generate my test for me:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
This is a fairly good start, and will give the output as a series of SQL queries that can be copied and run individually or run as a whole in SSMS. You could write a cursor to go through the result set and call exec() on each query, or you could get a little fancier:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
That guy will build you an all-in-one query that pulls from all tables and combines it all into one result set with the number of rows it found and the name of the table it queried. I added some logic to order all the rows so that it automatically stop adding UNIONs when it gets to the last row. The thing to keep in mind here is that if there are a lot of tables and these columns are not indexed, this query could take a while. You could add some where clauses to limit it to certain schemas or even particular tables. You could also build a CTE where you pull the results and then only show the rows where the count equals zero to zoom in on troublemakers. It’s also possible to have it build in GOs for you, which means you start getting rows earlier but go back to getting them one at a time. These are all left as exercises for the reader.