Friday 15 February 2013

Data Subset in TDM

In my previous post, I discussed the Challenges in Production Cloning approach.  In this post, we will focus on its solution, the Data Subset process / Data Sub-setting.

Data subset is the process of slicing a part of the Production Database and loading it into the Test Database.  For ex. instead of cloning a 50 TB production database, create a subset that is only 50 GB worth data and put it back into the Test Database.  Lets assume in a retail application, you have a Customers table having 10 million customers and Orders table having 100 million orders and 100 million other transaction tables, our subset process will try to shrink the sizes to good reasonable limits as depicted in the picture below.















Advantages of data sub-setting



  1. Reduced storage costs
  2. Reduced load time/Test data refresh time
  3. Reduced Infrastructure costs
  4. Massive savings in terms of multiple environments
The below picture depicts the advantages of data subset process:


Note: The above picture assumes that we are loading the subset of production data into 3 different regions (DEV, UAT and QA1).

Challenges in Data Subset

  • Referential Integrity
    • A real time database involves a lot of referential integrity between tables.  So lets assume, we are fetching only 100K Customers, we need to fetch orders only for those 100K customers and not all the orders from the production database.  In short, the criteria that is applied on a Parent table should get cascaded down to all the child tables.

  • Cross DB (Federated) relationships
    • In the previous point, I explained about the challenge in maintaining data integrity of the subset of data.  Whats worse, the data relationships can span across databases (For ex.  customers database can be in a Oracle database, and orders database can be in SQL Server).  This is called Federated relationships and is even more complicated to handle.
  • Data Relationships across multiple sources
    • If the previous two points were not enough, let me remind the fact that data can be from multiple sources, and so are the relationships.  For ex. a vendor might provide a data feed in a flat file format or an output from a Mainframe system might be a fixed length format, and there might be a relationship between the data in these files and the data residing in a Oracle database.  Handling all these relationships can be quite tricky in nature.

As I described in this post, the Subset process can save a lot of costs for the organization but have many challenges.  How to tackle those challenges?  What is the approach that we can follow for Data Sub-setting? I will try to throw more light in another detailed post.


About the Author

Rajaraman Raghuraman has nearly 8 years of experience in the Information Technology industry focusing on Product Development, R&D, Test Data Management and Automation Testing.  He has architected a TDM product from scratch and currently leads the TDM Product Development team in Cognizant.  He is passionate about Agile Methodologies and is a huge fan of Agile Development and Agile Testing.  He blogs at Test Data Management Blog & Agile Blog.  Connect with him on Google+

3 comments: