10. Introduction to Sub-Queries
Introduction to Sub-Queries
In SQL, a sub-query is essentially a query nested within another query. It's used to perform operations that require multiple steps, allowing you to use the result of one query as a condition or part of another query. Here are some key points to understand about sub-queries:
1. Definition: A sub-query is a query within another SQL query. It's enclosed in parentheses and can be used in various parts of the main query, such as in the SELECT, FROM, WHERE, or HAVING clauses.
2. Types of Sub-queries:
- Single-row sub-query: Returns zero or one row. Often used with comparison operators like =, <, >, etc.
- Multiple-row sub-query: Returns one or more rows. Used with operators like IN, ANY, ALL, EXISTS.
- Correlated sub-query: A type of sub-query that refers to columns from a table in the outer query. It's evaluated repeatedly, once for each row processed by the outer query.
3. Usage Examples:
- Filtering Results: You can use a sub-query in a WHERE clause to filter results based on a condition that requires another query. For example, finding employees who earn more than the average salary in their department.
- Selecting Data: Sub-queries can be used in the SELECT clause to return calculated values. For example, selecting the total sales along with the average sales from a different query.
- From Clause: In the FROM clause, a sub-query can be used to create a temporary table to join with other tables.
Example #1. Single-Row Sub-Query
A single-row sub-query returns only one row. It is often used with comparison operators like =, !=, <, >, etc.
Example:
Suppose you have a table Employees with columns EmployeeID, Name, Salary, and DepartmentID. You want to find employees who earn more than the average salary.
SELECT * FROM `dataacademykz.hr.employees_salaries` WHERE Salary > (SELECT AVG(Salary) FROM `dataacademykz.hr.employees_salaries`)
Here, the sub-query (SELECT AVG(Salary) FROM Employees) calculates the average salary of all employees, and the main query fetches employees earning more than this average.
Example #2. Multiple-Row Sub-Query
A multiple-row sub-query can return more than one row. It is used with operators like IN, ANY, ALL, EXISTS.
Example:
Suppose you have another table Departments with DepartmentID and DepartmentName. You want to find employees working in departments with more than 20 employees.
SELECT * FROM dataacademykz.hr.employees_data WHERE DepartmentID IN (SELECT DepartmentID FROM dataacademykz.hr.employees_data GROUP BY DepartmentID HAVING COUNT(EmployeeID) > 20)
The sub-query selects DepartmentID of departments with more than 10 employees, and the main query uses these IDs to find employees.
Example #3. Correlated Sub-Query
A correlated sub-query refers to elements from the outer query, making it dependent on the outer query. It is executed repeatedly, once for each row processed by the outer query.
Example:
Find the names of employees who earn more than the average salary in their respective departments.
SELECT E.Name, E.Salary FROM Employees E WHERE E.Salary > (SELECT AVG(Salary) FROM Employees WHERE DepartmentID = E.DepartmentID)
For each employee in the outer query, the sub-query calculates the average salary of their department and checks if their salary is higher than this average.
Example #4. Creating temporary tables using WITH statement
The WITH statement in SQL, also known as Common Table Expressions (CTEs), is used to create a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs can make complex queries more readable and maintainable by breaking them down into simpler parts. Here are some examples:
Basic CTE Example
Scenario: Suppose you have a table Sales with columns SaleID, SaleDate, and Amount. You want to find all sales that occurred after the average sale date.
WITH AverageSaleDate AS ( SELECT AVG(SaleDate) AS AvgDate FROM Sales )
SELECT * FROM Sales WHERE SaleDate > (SELECT AvgDate FROM AverageSaleDate)In this example, the CTE AverageSaleDate calculates the average sale date. The main query then uses this result to find all sales that occurred after this date.
Example #5. Sequential Aggregation and Filtering
Scenario: You want to find employees who have a salary higher than the average salary of those who have been employed for more than 5 years.
WITH LongTermEmployees AS ( SELECT EmployeeID, Salary FROM Employees WHERE YearsEmployed > 5 ),
AverageSalaryOfLongTerm AS ( SELECT AVG(Salary) AS AvgSalary FROM LongTermEmployees )
SELECT * FROM Employees WHERE Salary > (SELECT AvgSalary FROM AverageSalaryOfLongTerm)Here, the first CTE LongTermEmployees selects employees who have been employed for over 5 years. The second CTE AverageSalaryOfLongTerm calculates the average salary of these long-term employees. The main query then finds employees whose salary is above this average.
Example #6. A simple example from the Superstore dataset
Let’s go back to our Superstore dataset. The task involves calculating each region’s total and average sales. After obtaining these figures, we must compare them with the dataset’s average sales. For regions where sales exceed the average, we will enter ‘Greater’ in the subsequent column. Conversely, for regions with sales below the average, ‘Less’ will be entered in the following column.
WITH SalesComparison AS( SELECT Region, SUM(Sales) AS TotalSales, ROUND(AVG(Sales), 1) AS AverageSales, ROUND(( SELECT AVG(Sales) FROM dataacademykz.superstore.sales_analytics), 1) AS OverallAverageSales FROM dataacademykz.superstore.sales_analytics GROUP BY Region ) SELECT Region, TotalSales, AverageSales, OverallAverageSales, CASE WHEN AverageSales > OverallAverageSales THEN "Greater" ELSE "Less" END AS Comparison FROM SalesComparison
Explanation:
Imagine you’re running a big chain of stores and want to see how different regions are doing compared to the average performance of the entire chain.
- First, we calculate how much each region is selling in total and on average.
- Then, we figure out the average sales for the whole chain.
- Next, we see if each region is doing better (Greater) or worse (Less) than this overall average.
It's like giving each region a report card showing how they stack up against the chain’s average performance.
Technical details:
This SQL query performs a two-step analysis to compare regional sales against the overall average sales of the entire dataset. Here's the technical breakdown:
First Step (CTE - Common Table Expression):
- The query starts with a CTE named
SalesComparison
. This temporary table calculates total and average sales for each region.SUM(Sales) AS TotalSales
andROUND(AVG(Sales), 1) AS AverageSales
compute the total and average sales for each region, respectively.- A subquery within this CTE computes the overall average sales of the dataset (
SELECT AVG(Sales) FROM dataacademykz.superstore.sales_analytics
) and is labeled asOverallAverageSales
.Second Step (Main Query):
- The main query selects the region, total sales, average sales, and overall average sales from the CTE.
- A
CASE
statement compares the average sales of each region (AverageSales
) with the overall average sales (OverallAverageSales
). It assigns 'Greater' to regions where average sales exceed the overall average, and 'Less' otherwise.Result: The final output provides a comprehensive comparison of each region's sales performance against the overall average, facilitating a clear understanding of regional market dynamics.
We recommend viewing this YouTube video to better understand sub-queries: