Big Data Testing
The smart automated Data Testing solution for Hadoop data lakes & NoSQL data stores
Big Data
Big Data mainly describes large amounts of data typically stored in either Hadoop data lakes or NoSQL data stores. Big Data is defined by the 5 Vs:
- Volume – the amount of data from various sources
- Velocity – the speed of data coming in
- Variety – types of data: structured, semi-structured, unstructured
- Veracity – the extent to which the data is trustworthy
- Value – ensure insights from the data have value beyond the underlying cost
19.2% of big data app developers say quality of data is the biggest problem they consistently face.”
Data is growing at a rapid pace. According to IBM, 90% of the world’s data has been created in the past 2 years. And with lots of data comes Bad Data (also known as Data Defects or Data Bugs).
Bad data can be described as data that is inaccurate for the business.
Types of bad data include:
- Missing Data
- Truncation of Data
- Data Type Mismatch
- Null Translation
- Wrong Translation
- Misplaced Data
- Extra Records
- Not Enough Records
- Input Errors (capitalization, formatting, spacing, punctuation, spelling, other)
- Transformation Logic Errors/Holes
- Sequence Generator
- Duplicate Records
- Numeric Field Precision
And this is important because C‑level executives are using BI & Analytics to make critical business decisions with the assumption that the underlying data is fine.
Based on our experience, we know it is not.
Below is a typical data architecture and highlighted in red is the Big Data flow: data coming in from source databases, flat files, web services, mainframe and other sources into a big data lake and then into a data warehouse, then a data mart and finally into a business intelligence reports.
Big Data Testing Challenges
Typical testing around traditional data warehouses or databases revolve around structured data and using SQL to accomplish the testing. Big Data testing is completely different.
Below are some challenges of testing Big Data:
- The volume of data can be overwhelming
- Big Data testing is very complicated and requires experienced and highly skilled testers
- Testing of structured, semi-structured and unstructured data is complex
- Hadoop testing typically relies on HQL, relegating the 2 data testing main methods, Sampling (also known as“stare and compare”) and Minus Queries, unusable
- Testing is complicated when dealing with security systems like Kerberos
- Testing infrastructure requires a special test environment due to large data size and files (HDFS)
Point-to-Point Testing. The QuerySurge ETL testing process mimics the ETL development process by testing data from point-to-point along the data lifecycle.
Finding Bad Data
Using QuerySurge allows your team to implement a repeatable data validation and testing strategy that avoids the adverse impact any of these defects can have on your Big Data efforts. More»
Supported Data Technologies
QuerySurge supports connections to Big Data lakes and NoSQL data stores, data warehouses and databases, files and APIs, collaboration software, CRMs and ERPs, and accounting, marketing and ecommerce software. See the full list here»
QuerySurge Implementation: Where can it be installed?
QuerySurge can be installed on a bare metal server, on a virtual machine (VM), in a private or public cloud, or use our pay-as-you-go service in Microsoft Azure.
-
...on a Bare Metal Server
-
...on a VM
-
...in a private or public Cloud
-
...with our pay-as-you-go Microsoft Azure Cloud offering
QuerySurge Implementation: How long can it be used?
QuerySurge Subscription licenses run in 12-month allotments and our Azure offering is hourly, so it can be used for as short a time frame as needed.
QuerySurge Licensing and Pricing Options
Choose the QuerySurge licensing and pricing model that best fits your company’s needs. All licensing and pricing information can be found here»
QuerySurge will help you:
- Leverage artificial intelligence to quickly & easily increase test coverage
- Continuously detect data issues in the delivery pipeline
- Utilize analytics to optimize your critical data
- Improve your data quality at speed
- Provide a huge ROI
But don’t believe us (or our clients). Try it for yourself.
Check out our free trials and great tutorial