Basic Data Quality Checks

Grandson and Grandfather

You might be surprised, but a lot of the data we get is not quite ready to use. It's often messy or "dirty," which means we need to clean it up before we can really work with it. Today, we are going to learn about basic data quality checks.

Task #1: Finding Duplicates

In data analytics, ensuring data integrity is crucial, and one key aspect of this is identifying and handling duplicates. If a column, such as 'Row_ID', is intended to be a unique identifier, it's essential to confirm that each value is indeed unique. This task focuses on detecting any duplicates in the 'Row_ID' column.

Objective: Identify duplicate entries in the 'Row_ID' column of the dataset.

Expected Result: The query will highlight any 'Row_ID' values that occur more than once, indicating duplicates.

Data Source: dataacademykz.superstore.sales_dirty_data

 

Task #2: Validating Date Consistency

In data analysis, particularly when dealing with logistics or sales data, the consistency and logical ordering of dates are crucial. Dates provide key insights into the timeline of events such as orders and shipments. It's generally expected that the order date precedes the ship date. If the reverse is true, it indicates a potential data inconsistency that needs to be addressed.

Objective: Identify any instances where the order date is later than the ship date, which would indicate an anomaly in the dataset.

Expected Result: The query should reveal records where the order date is anomalously later than the ship date.

Data Source: dataacademykz.superstore.sales_dirty_data

 

Task #3: Checking for NULL Values in Key Columns

In any dataset, especially in a sales context, certain columns are fundamental and should always contain valid data. Columns like 'Order_ID' and 'Customer_ID' are critical as they uniquely identify orders and customers, respectively. Ensuring these columns do not have NULL values is essential for maintaining data integrity and reliability.

Objective: Verify that all records in the dataset have non-NULL values in key columns, specifically 'Order_ID' and 'Customer_ID'.

Expected Result: The query will identify any records where 'Customer_ID' or 'Order_ID' is missing (NULL), indicating areas in the dataset that require attention.

Data Source: dataacademykz.superstore.sales_dirty_data

 

Task #4: Checking for negative values in positive columns

In data analytics, especially when dealing with sales and inventory data, ensuring the logical consistency of numerical values is essential. Certain columns, like 'Quantity', inherently should only have positive values, as they represent things like the number of products sold or in stock. Negative values in such columns could indicate data entry errors or other inconsistencies.

Objective: Identify any instances in the dataset where the 'Quantity' column has negative values, which would be illogical and indicative of data issues. Return only Row_ID and Order_ID.

Expected Result: The query is designed to reveal records with negative quantities, which should not logically occur.

Data Source: dataacademykz.superstore.sales_dirty_data

 

Task #5: Identifying Outliers in Numerical Data (Advanced)

In data analysis, outliers can significantly impact the interpretation of results. They are extreme values that differ significantly from other observations. In a sales context, unusually high or low sales figures can be outliers, potentially skewing averages and other statistical measures. Identifying these outliers is crucial for accurate and meaningful analysis.

Objective: Detect outliers in the 'Sales' column of the dataset, which are significantly higher or lower than the typical range of sales values.

Expected Result: The query aims to isolate sales figures that deviate markedly from the average, based on a defined statistical threshold.

Data Source: dataacademykz.superstore.sales_dirty_data

 

Hint: Removing dirty data

To get rid of the data we don't need, we can make a new, cleaner table. This means we only keep the good stuff:

CREATE OR REPLACE TABLE
  `your_project_name.your_dataset_name.sales_clean_data` AS
SELECT
  *
FROM
  `dataacademykz.superstore.sales_dirty_data`
WHERE
  Customer_ID IS NOT NULL

This way, by specifying what we don’t want to see, we create the table without redundant data points.

 

Explanation: WHERE and HAVING clauses

For now, you may have a question – what is the difference between WHERE and HAVING statements?

The WHERE and HAVING clauses in SQL are both used to filter data, but they serve different purposes and are used in different contexts: