SQL for Data Analysis in Google BigQuery: 7. Basic Data Quality Checks

Basic Data Quality Checks

Grandson and Grandfather

You might be surprised, but a lot of the data we get is not quite ready to use. It's often messy or "dirty," which means we need to clean it up before we can really work with it. Today, we are going to learn about basic data quality checks.

Task #1: Finding Duplicates

In data analytics, ensuring data integrity is crucial, and one key aspect of this is identifying and handling duplicates. If a column, such as 'Row_ID', is intended to be a unique identifier, it's essential to confirm that each value is indeed unique. This task focuses on detecting any duplicates in the 'Row_ID' column.

Objective: Identify duplicate entries in the 'Row_ID' column of the dataset.

Expected Result: The query will highlight any 'Row_ID' values that occur more than once, indicating duplicates.

Data Source: dataacademykz.superstore.sales_dirty_data
SELECT
  Row_ID,
  COUNT(Row_ID) AS number_of_occurrences
FROM
  `dataacademykz.superstore.sales_dirty_data`
GROUP BY
  Row_ID
HAVING
  number_of_occurrences > 1
ORDER BY
  Row_ID
Explanation:

Imagine you're organizing a set of files, each with a unique identification number or Row_ID. In a perfect world, each file should have a different ID. However, sometimes mistakes happen, and you might have multiple files with the same ID.

This query is like going through your files, counting how many times each ID shows up, and then making a list of IDs that appear more than once. It's like a detective work to spot where the mix-ups are.

This query helps us identify Row_IDs that occur more than once, indicating duplicates. In this case, we see that Row_ID 15 is duplicated.

Technical Breakdown:

This SQL query is designed to identify duplicate records in a dataset, specifically focusing on the Row_ID column, which is expected to be unique.

Grouping: The query groups the data by Row_ID from the sales_dirty_data table by using GROUP BY statement. This operation aggregates the data such that each Row_ID value is treated as a unique group.

Counting: For each Row_ID group, the query counts the number of occurrences using COUNT(Row_ID). This count reflects how many times each Row_ID appears in the dataset.

Having Clause: The HAVING clause is used to filter the results to only include groups where the Row_ID occurs more than once (number_of_occurrences > 1). Unlike the WHERE clause, HAVING operates on grouped data.

Result: The output of the query lists each Row_ID that appears more than once, along with the count of its occurrences, effectively highlighting duplicates.

Task #2: Validating Date Consistency

In data analysis, particularly when dealing with logistics or sales data, the consistency and logical ordering of dates are crucial. Dates provide key insights into the timeline of events such as orders and shipments. It's generally expected that the order date precedes the ship date. If the reverse is true, it indicates a potential data inconsistency that needs to be addressed.

Objective: Identify any instances where the order date is later than the ship date, which would indicate an anomaly in the dataset.

Expected Result: The query should reveal records where the order date is anomalously later than the ship date.

Data Source: dataacademykz.superstore.sales_dirty_data
SELECT
  *
FROM
  `dataacademykz.superstore.sales_dirty_data`
WHERE Order_Date > Ship_Date
Explanation:

Imagine planning a shipment: you usually can't ship an item before the order comes in. This query is like checking a list of orders and shipments to make sure everything was shipped after it was ordered.

If we find any orders that were somehow "shipped" before they were even "ordered," it's like finding a glitch in our timeline – these are the mix-ups we're looking for. It helps us catch and fix these little time-travel errors in our records.

Technical Breakdown:

This SQL query is focused on ensuring the logical consistency of date-related data, particularly verifying the chronological order of Order_Date and Ship_Date in the dataset. Here's the technical breakdown:

Purpose: The primary goal is to identify any records where the Order_Date is later than the Ship_Date, which logically, is an inconsistency.

Selection: The query selects all fields from the sales_dirty_data table (SELECT *), ensuring comprehensive visibility into each record that meets the criteria.

Condition: The WHERE Order_Date > Ship_Date clause filters the dataset. This condition identifies records where the order date is erroneously later than the shipping date.

Result: The output of this query lists complete records from the dataset where the date sequence is incorrect, signaling potential errors in data entry or processing.

Task #3: Checking for NULL Values in Key Columns

In any dataset, especially in a sales context, certain columns are fundamental and should always contain valid data. Columns like 'Order_ID' and 'Customer_ID' are critical as they uniquely identify orders and customers, respectively. Ensuring these columns do not have NULL values is essential for maintaining data integrity and reliability.

Objective: Verify that all records in the dataset have non-NULL values in key columns, specifically 'Order_ID' and 'Customer_ID'.

Expected Result: The query will identify any records where 'Customer_ID' or 'Order_ID' is missing (NULL), indicating areas in the dataset that require attention.

Data Source: dataacademykz.superstore.sales_dirty_data
SELECT
  *
FROM
  `dataacademykz.superstore.sales_dirty_data`
WHERE
  Customer_ID IS NULL
  OR Order_ID IS NULL
Explanation:

Think of a puzzle with missing pieces – that's what it's like when key information like customer IDs or order IDs is missing from our data. This query is like going through each record in our database with a magnifying glass to find where these pieces are missing.

If we find a record without a customer ID or an order ID, it's flagged. It's like marking a spot in a puzzle where a piece should be, but isn't.

The OR condition means we're looking for any record that's missing either one of these pieces of information – not necessarily both at the same time.

Technical Breakdown:

This SQL query is designed to identify any records in the dataset that have missing (NULL) values in critical columns, specifically Order_ID and Customer_ID. Here's the technical breakdown:

Objective: The primary aim is to ensure data completeness in essential fields of the dataset.

Selection: The query selects all fields from the sales_dirty_data table (SELECT *), enabling a full view of records that fail the completeness check.

Condition: The WHERE Customer_ID IS NULL OR Order_ID IS NULL clause is the core of the query. It filters the dataset for records where either Customer_ID or Order_ID (or both) are missing. The use of the OR operator indicates that the query will return a record if either condition is met.

Result: The output includes all records with incomplete data in either the Customer_ID or Order_ID fields, highlighting gaps in critical information.

In this query, we checked our columns using the WHERE statement. OR operator gives us the condition that at least one of the values must be NULL (in contrast, the AND operator gives the condition that both values must be NULL).

Task #4: Checking for negative values in positive columns

In data analytics, especially when dealing with sales and inventory data, ensuring the logical consistency of numerical values is essential. Certain columns, like 'Quantity', inherently should only have positive values, as they represent things like the number of products sold or in stock. Negative values in such columns could indicate data entry errors or other inconsistencies.

Objective: Identify any instances in the dataset where the 'Quantity' column has negative values, which would be illogical and indicative of data issues. Return only Row_ID and Order_ID.

Expected Result: The query is designed to reveal records with negative quantities, which should not logically occur.

Data Source: dataacademykz.superstore.sales_dirty_data
SELECT
  Row_ID, Order_ID, Quantity
FROM
  `dataacademykz.superstore.sales_dirty_data`
WHERE
  Quantity < 0
Explanation:

Imagine you're looking through a list of sales or stock records and you come across entries showing you sold or have a negative number of a product – which doesn't make sense because you can't sell or have less than zero of something.

This query scans through your sales records to find any of these odd cases where the quantity is less than zero.

It lists out the specific records where this happens, showing the unique ID of each record and the order it's associated with.

Finding these records helps you spot and correct mistakes in your sales or inventory data, ensuring your records accurately reflect what's actually happening in your store.

Technical Breakdown:

This SQL query is designed to identify records in the dataset where the 'Quantity' column, which logically should only contain positive values, has negative entries. Here's the technical breakdown:

Objective: The aim is to find instances where the 'Quantity' column, which represents the number of products sold or in stock, has illogical negative values.

Selection and Condition:

The query selects Row_ID, Order_ID, and Quantity from the dataacademykz.superstore.sales_dirty_data table.

The WHERE Quantity < 0 clause filters the dataset to include only those records where the Quantity value is less than zero, which is logically inconsistent.

Result:

The output includes the Row_ID, Order_ID, and Quantity for each record where the quantity is a negative number, highlighting potential data errors or anomalies in the dataset.

Task #5: Identifying Outliers in Numerical Data (Advanced)

In data analysis, outliers can significantly impact the interpretation of results. They are extreme values that differ significantly from other observations. In a sales context, unusually high or low sales figures can be outliers, potentially skewing averages and other statistical measures. Identifying these outliers is crucial for accurate and meaningful analysis.

Objective: Detect outliers in the 'Sales' column of the dataset, which are significantly higher or lower than the typical range of sales values.

Expected Result: The query aims to isolate sales figures that deviate markedly from the average, based on a defined statistical threshold.

Data Source: dataacademykz.superstore.sales_dirty_data
SELECT *
FROM `dataacademykz.superstore.sales_dirty_data`
WHERE Sales > (SELECT AVG(Sales) + 5 * STDDEV(Sales) FROM `dataacademykz.superstore.sales_dirty_data`)
   OR Sales < (SELECT AVG(Sales) - 5 * STDDEV(Sales) FROM `dataacademykz.superstore.sales_dirty_data`)
Explanation:

Imagine sales numbers as a long road. Most sales are like cars traveling close to the speed limit (the average). But then you have some cars zooming way faster or crawling much slower — these are outliers.

We're using a "speed limit" here that's not just the average speed, but also accounts for how much speeds typically vary (that's the standard deviation).

If a car (sale) is going way faster or slower than this limit — say, more than five times the usual speed variation — we flag it as unusual or an outlier.

Technical Breakdown:

Let's break it down. There are two conditions connected by an OR operator in the WHERE statement, each using a subquery:

1. Upper Bound Outlier Check:
Sales > (SELECT AVG(Sales) + 5 * STDDEV(Sales) FROM `dataacademykz.superstore.sales_dirty_data`)
This condition checks if the Sales value in each row is greater than the average Sales plus five times the standard deviation of Sales across the entire table. It's a common statistical method for identifying outliers.

The subquery (SELECT AVG(Sales) + 5 * STDDEV(Sales) FROM dataacademykz.superstore.sales) calculates the threshold for what is considered an unusually high sale.

2. Lower Bound Outlier Check
Sales > (SELECT AVG(Sales) - 5 * STDDEV(Sales) FROM `dataacademykz.superstore.sales_dirty_data`)
Similarly, this condition checks if the Sales value is less than the average Sales minus five times the standard deviation of Sales.

This subquery calculates the lower threshold for an unusually low sale.

Overall Functionality

The query retrieves all rows from the sales table where the Sales values are considered statistical outliers, defined as being more than five standard deviations away from the mean (either above or below).

Hint: Removing dirty data

To get rid of the data we don't need, we can make a new, cleaner table. This means we only keep the good stuff:
CREATE OR REPLACE TABLE
  `your_project_name.your_dataset_name.sales_clean_data` AS
SELECT
  *
FROM
  `dataacademykz.superstore.sales_dirty_data`
WHERE
  Customer_ID IS NOT NULL
This way, by specifying what we don’t want to see, we create the table without redundant data points.

Explanation: WHERE and HAVING clauses

For now, you may have a question – what is the difference between WHERE and HAVING statements?

The WHERE and HAVING clauses in SQL are both used to filter data, but they serve different purposes and are used in different contexts:

Basic Data Quality Checks

Task #1: Finding Duplicates

Task #2: Validating Date Consistency

Task #3: Checking for NULL Values in Key Columns

Task #4: Checking for negative values in positive columns

Task #5: Identifying Outliers in Numerical Data (Advanced)

Hint: Removing dirty data

Explanation: WHERE and HAVING clauses