Sampling

The most popular data validation method currently utilized is known as Sampling (the other method being Minus Queries). The Sampling Method, also known as Stare & Compare, is well-intentioned, but is loaded with risk – the risk of not fully testing large data flows.

Sampling commonly uses the following process:

Review the Business Rules / Mappings

Business Rules are also known as Mappings or Source-to-Target mappings and are typically found in a Mapping Document. The mapping tables in the document are the requirements or rules for extracting, transforming (if at all) and loading (ETL) data from the source database and files into the target data warehouse or big data store. Specifically, the mapping fields show:

Table names, field names, data types and length of both source and target fields
How source tables / files should be joined in the new target data set
Any transformation logic that will be applied
Any business rules that will be applied

Create Test Cases

Each Mapping will typically have its own test case. Test Case will typically have two sets of SQL queries (or HQL for Hadoop). One query will extract data from the sources (flat files, databases, xml, web services, etc.) and the other query will extract data from the target (Data Warehouses or Big Data stores).

1) Extract data from source data stores
Click to Enlarge
2) Extract data from target Data Warehouse
Click to Enlarge

Execute Tests, Export Results

These tests are typically executed using a SQL editor such as Toad, SQuirrel or any other favorite editor.

The test results from the 2 queries are saved into 2 Excel spreadsheets.

Qs sample stare compare — Click to Enlarge

Compare Results

Compare all result sets in the source spreadsheet with the target spreadsheet by eye compare. (also known as “Stare & Compare”). There will be lots of scrolling to the right to compare dozens, if not hundreds of columns and lots of scrolling down to compare tens of thousands or even millions of rows.

Primary Issue with Sampling or ‘Stare & Compare’

It is impossible to visually compare billions of data sets – hundreds of columns and millions of rows in 2 separate spreadsheets effectively.

The result of this method is that usually less than 1% of data is compared. Since many companies are using Business Intelligence (BI) to make strategic decisions in the hope of gaining a competitive advantage in a tough business landscape, bad data will cause them to make decisions that will cost their firms millions of dollars.

$8.2 million

what the average organization loses annually because of bad data
$20 million

22% estimated this annual loss resulting from bad data
$100 million

4% put that figure at this astounding amount

_{source: Gartner, Inc.}

Case Study: HealthCare company

A HealthCare company approached us with the following dilemma:

they had more than 7,000 total tests
their tests were for very large data sets
the largest test had 100 million rows and 200 columns = 20 billion data sets
they were utilizing the Sampling Method for data validation
they had no practical way to manually verify more than a fraction of the data without an automation process
they were experiencing lots of instances of bad data

After trialing QuerySurge and seeing how much more data validation they could do, and with much shorter testing cycles, they incorporated QuerySurge into their formal ETL process. See more details of the Case Study here»

QuerySurge - The Automated Data Validation & Testing Solution

QuerySurge is the leading solution for automating the testing of Data Warehouses and Big Data. QuerySurge ensures that the data extracted from data sources remains intact in the target data store by analyzing and pinpointing any differences quickly. And QuerySurge makes it really easy for both novice and experienced team members to validate their organization’s data quickly through our Query Wizards while still allowing power users the ability to write custom code.

All with Deep-Dive reporting, a Data Health dashboard, Test Management integration, DevOps compatibility and the ability to connect to any data source. QuerySurge is everything you need to automate the testing and validation of your project’s data.

Learn more about QuerySurge here»

The Benefits of QuerySurge

Easily automate your manual testing effort for repeatability
Provide testing across different platforms – data warehouses Hadoop and NoSQL stores, traditional databases, flat files, Excel, web services, json, XML and others
Speed up testing up to 1,000 x while providing up to 100% data coverage
Continuous Delivery — integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
Deliver shareable, automated email reports and data health dashboards
Provide a huge Return On Investment (ROI), as much as 1,600%