Wednesday, 20 February 2013

Implementation Approaches to Data Sub-setting

In one of my previous post, I described about the process of Data Subset.  In this post we will focus on the implementation approaches to data sub-setting.

There are 3 broad categories in which you can implement sub-setting.

SQL Query based approach

In this approach, we will use SQL queries to fetch the subset of the production data and load them it into the target environment.  Lets assume you have 2 tables in your production from which you need to create a small subset.  The following shows the relationship of the tables Customers and Orders where they are related through the custid field.  



The picture also shows the sample data within those tables.  So we need to subset this.  We find out a sample condition.  Lets assume we will pull out only the customer ids which are odd numbers.  A simple query will do this trick.  The following will be the query for the Customers table.



SELECT *  FROM [TDMMock].[dbo].[Customers] where custid % 2 = 1

This query will return the customer ids which are odd numbers.  So in our case, custid - 1,3,5,7,9 will be in the result set.  But the trick here is, the orders table should also contain orders only pertaining to custids 1,3,5,7,9.  So how to do that.  SQL Query again to the rescue.  The following will be the query for the Orders table.


SELECT * FROM [TDMMock].[dbo].[Orders]
WHERE order_custid in
(SELECT custid FROM [TDMMock].[dbo].[Customers] WHERE custid % 2 = 1)


What this simple query does is fetches only the orders based on the parent customer's subset conditions.  This concept applies to how much ever tables that are under consideration and it forms the basic building block of any Data Subset solution / algorithm.  The following picture depicts the queries and the corresponding result sets.  














As you can see the data in the customers and orders table are in sync only containing custid 1,3,5,7,9.  This is a valid subset.

Bottom line is 

  • We now have a valid subset of both Customers and Orders table
  • Data integrity is still intact as the foreign key is still in sync

Thus with the help of simple and in a real time project, more complex SQL Queries, we can do effective sub-setting.  However there are a few technical challenges in this approach which I would probably detail in another post.

Pros:
  1. Quick to build especially with Database knowledge
  2. Easier to understand for the DBAs
Cons:
  1. Difficult to maintain with changes.  Changes in the subset criteria in a parent will affect all the subsequent child queries also.
  2. Need to have knowledge of the data model & good working knowledge of SQL Queries, especially when needed to use in a medium to large project.
  3. More complications in case of multiple data sources.
  4. Need to optimize the queries used or else it can hamper the performance.
Custom solution approach

In this approach, a customized solution can be built according to the project needs and it is used in the project.  The technology depends on the project requirements and the expertise levels within the team.  For example, a Java or .NET based UI can be developed on top of the SQL Queries, so that the user need not worry about the inner details of how the query is going to be constructed.

Pros:
  1. Takes time to build depending on the nature of the project
  2. Can be maintained like a regular development project
  3. Can be cost effective for small to medium sized projects
Cons:
  1. Needs both DB as well as development knowledge
  2. Maintenance costs associated with maintaining the solution
  3. Need to worry about optimization and performance or else it can hamper the data load time.
  4. More complications in case of multiple data sources

Commercial Tool based approach

Maintaining SQL scripts and maintaining a custom solution has its own demerits.  Hence rather than reinventing the wheel, a commercial TDM tool (Subset feature) can be used for the subsetting needs of the project.  There are many such tools in the market.  However tools such as Informatica ILM TDM, IBM Infosphere Optim, GridTools DataMaker are the market leaders in this space.  They are optimized for high performance and are feature rich and supports multiple data sources.

Pros:
  1. Scalable, Robust & proven solutions
  2. Support for a large number of data sources and platforms
  3. Feature rich
  4. Support model available
Cons:
  1. High costs usually associated with them
  2. Learning curves can be steep for some of the tools.

As we saw there are different implementation approaches and each has its own pros and cons.  All these factors need to be taken into consideration before making a final decision on the approach.

Thanks for the read.  Hope the information in this post was useful.  Comments are welcome.  If you have any questions / alternate approaches, please feel free to add a comment.


About the Author

Rajaraman Raghuraman has nearly 8 years of experience in the Information Technology industry focusing on Product Development, R&D, Test Data Management and Automation Testing.  He has architected a TDM product from scratch and currently leads the TDM Product Development team in Cognizant.  He is passionate about Agile Methodologies and is a huge fan of Agile Development and Agile Testing.  He blogs at Test Data Management Blog & Agile Blog.  Connect with him on Google+

No comments:

Post a Comment