Sampling Method of Data Validation
This most popular data validation method provides little coverage with lots of risk.
The most popular data validation method currently utilized is known as Sampling (the other method being Minus Queries). The Sampling Method, also known as Stare & Compare, is well-intentioned, but is loaded with risk – the risk of not fully testing large data flows.
Sampling commonly uses the following process:
Review the Business Rules / Mappings
Business Rules are also known as Mappings or Source-to-Target mappings and are typically found in a Mapping Document. The mapping tables in the document are the requirements or rules for extracting, transforming (if at all) and loading (ETL) data from the source database and files into the target data warehouse or big data store. Specifically, the mapping fields show:
- Table names, field names, data types and length of both source and target fields
- How source tables / files should be joined in the new target data set
- Any transformation logic that will be applied
- Any business rules that will be applied
Create Test Cases
Each Mapping will typically have its own test case. Test Case will typically have two sets of SQL queries (or HQL for Hadoop). One query will extract data from the sources (flat files, databases, xml, web services, etc.) and the other query will extract data from the target (Data Warehouses or Big Data stores).
Execute Tests, Export Results
These tests are typically executed using a SQL editor such as Toad, SQuirrel or any other favorite editor.
The test results from the 2 queries are saved into 2 Excel spreadsheets.
Compare Results
Compare all result sets in the source spreadsheet with the target spreadsheet by eye compare. (also known as “Stare & Compare”). There will be lots of scrolling to the right to compare dozens, if not hundreds of columns and lots of scrolling down to compare tens of thousands or even millions of rows.
Primary Issue with Sampling or ‘Stare & Compare’
It is impossible to visually compare billions of data sets – hundreds of columns and millions of rows in 2 separate spreadsheets effectively.
The result of this method is that usually less than 1% of data is compared. Since many companies are using Business Intelligence (BI) to make strategic decisions in the hope of gaining a competitive advantage in a tough business landscape, bad data will cause them to make decisions that will cost their firms millions of dollars.
-
$8.2 million
what the average organization loses annually because of bad data
-
$20 million
22% estimated this annual loss resulting from bad data
-
$100 million
4% put that figure at this astounding amount
Case Study: HealthCare company
A HealthCare company approached us with the following dilemma:
- they had more than 7,000 total tests
- their tests were for very large data sets
- the largest test had 100 million rows and 200 columns = 20 billion data sets
- they were utilizing the Sampling Method for data validation
- they had no practical way to manually verify more than a fraction of the data without an automation process
- they were experiencing lots of instances of bad data
After trialing QuerySurge and seeing how much more data validation they could do, and with much shorter testing cycles, they incorporated QuerySurge into their formal ETL process. See more details of the Case Study here»
QuerySurge - The Automated Data Validation & Testing Solution
QuerySurge is the leading solution for automating the testing of Data Warehouses and Big Data. QuerySurge ensures that the data extracted from data sources remains intact in the target data store by analyzing and pinpointing any differences quickly. And QuerySurge makes it really easy for both novice and experienced team members to validate their organization’s data quickly through our Query Wizards while still allowing power users the ability to write custom code.
All with Deep-Dive reporting, a Data Health dashboard, Test Management integration, DevOps compatibility and the ability to connect to any data source. QuerySurge is everything you need to automate the testing and validation of your project’s data.
Learn more about QuerySurge here»
The Benefits of QuerySurge
- Easily automate your manual testing effort for repeatability
- Provide testing across different platforms – data warehouses Hadoop and NoSQL stores, traditional databases, flat files, Excel, web services, json, XML and others
- Speed up testing up to 1,000 x while providing up to 100% data coverage
- Continuous Delivery — integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
- Deliver shareable, automated email reports and data health dashboards
- Provide a huge Return On Investment (ROI), as much as 1,600%
But don’t believe us (or our clients). Try it for yourself.
Check out our free trials and great tutorial