Wednesday, 13 February 2013

Top 3 Challenges in using Production data in Test Environments

In my previous post "How to create Test Data", I explained the concept of creating test data directly from the production data.  In this post we will concentrate on the Top 3 challenges in using the Production data for testing purposes.

Data Security

This is by far the most crucial challenge of using Production data in Test Environments.  Production data can contain a lot of sensitive information.  Even though the data sets will be rich in nature in the Production database, the very thought of using production data involves a lot of risk.  For ex. if you are testing an application for a bank, production data will contain real customer information like Names, Addresses, Account Numbers, Balances, Credit Card Numbers, etc.  As you can see, if you try to use these data for testing, it exposes huge security risks for the bank. So how do we overcome this, the answer is Data Masking.

Data Masking is the process of masking of the sensitive fields from the complete data set.  Please read my future post on Data Masking and the Techniques used for Data Masking for more details.  The following figure depicts the data security challenge and the approaches.

Data Security Challenge

Data Volumes

Another one of the biggest challenges that we face in using Production data in testing is that the data volumes that we deal with is pretty huge.  Assuming the example of a bank, it will contain huge number of Customer data and also the data of all the transactions that the customers have made.  Assuming a very simple case of 100K customers doing an average of 5 transactions per month will generate about 500K transaction records per month.  Production data will contain transactions right from the inception of the bank.  Just imagine the scale of data that needs to be loaded into the Test Region if all the data is to be moved.  

This method of moving the entire production data into the test region is called Production Cloning and has several disadvantages such as increased load time and increased disk costs.  The post "Challenges in Production Cloning Approach" describes the challenges in detail.  So how to overcome this challenge.  The answer is Data Subset / Data Sampling where you load only a subset or portion of the production data into the Test database.  Please read my future post on Data Subset / Data Sampling for more details.

Data Sources

Another major challenge is the variety of data sources.  For example, in a real time enterprise application, the data could come from multiple sources namely RDMBS like Oracle, DB2, SQL Server, Sybase, Informix, etc and from file sources such as Excel, Flat Files, Mainframe delimited files, EDI files, etc and also from sources such as Web services.  And worse there will be relationships between the data that flows from and to these data sources.  Hence while loading the production data to the test region, utmost care should be taken to maintain the data relationships and data integrity.  Please read my future post on Data relationships and their effects on TDM approach.

About the Author

Rajaraman Raghuraman has nearly 8 years of experience in the Information Technology industry focusing on Product Development, R&D, Test Data Management and Automation Testing.  He has architected a TDM product from scratch and currently leads the TDM Product Development team in Cognizant.  He is passionate about Agile Methodologies and is a huge fan of Agile Development and Agile Testing.  He blogs at Test Data Management Blog & Agile Blog.  Connect with him on Google+


  1. Agreed with the challenges in Data Sources. I've dealt with a lot of orphan entities in the test region raising false alarms because of corrupt data relationships and is a complete waste of time. It is also possible that a valid defect found could potentially be ignored citing bad data in the test region. Testers need to very careful in those instances.

  2. Yes Logu. Actually maintaining data integrity is one of the critical bottle necks in TDM. Many clients claim that masking is typically an easier solution than subsetting because of the fact that Subsetting needs to first source the right data for the masking engine and it needs to do it without breaking the relationships.

    Rajaraman R
    TDM Blog || Agile Blog