12. Introduction to Window Functions
Introduction to Window Functions
Window functions are specialized SQL functions used for performing calculations across a set of rows related to the current row. Unlike standard aggregate functions, window functions maintain individual row integrity, rather than collapsing the data into a single output row. They are particularly useful for running totals, moving averages, and cumulative statistics.
Before continuing reading the article, we recommend watching this YouTube video:
Key Concepts in Window Functions
- Window: The set of rows a window function operates on.
- Partitioning: Segmenting the data into smaller groups for independent processing by the window function.
- Ordering: Establishing the sequence for processing the rows within a window.
- Frame Specification: Outlining a specific subset of rows within a partition for the function's consideration (e.g., rows between the current row and the next 10 rows).
Basic Syntax of Window Functions
The general structure for a window function is as follows:
SELECT [column_name], [window_function]([column_name]) OVER ( [PARTITION BY [partition_column]] [ORDER BY [ordering_column]] [frame_clause] )
FROM [table_name];
Practical Examples:
Example #1. Running Total of Sales
Task: Compute a running total of sales for each order date.
-- CTE for calculating daily sales sum
WITH OrderDateSum AS( SELECT Order_Date, -- Date of the order SUM(Sales) AS Sales, -- Sum of sales for each day FROM `dataacademykz.superstore.sales_analytics` -- Source table GROUP BY Order_Date -- Grouping by order date ORDER BY Order_Date -- Ordering by order date )
SELECT Order_Date, -- Date of the order Sales, -- Sales amount for each day SUM(Sales) OVER (ORDER BY Order_Date) AS Running_Total_Sales -- Calculating running total of sales FROM OrderDateSum -- Referencing the CTEHow did we do that? You are already familiar with sub-queries from the previous lesson, right? We created a simple temporary table called OrderDateSum, where we calculated the sum of sales for each order date.
Then we query this temporary to table to get the final result – running total of sales using window function SUM(Sales) OVER(ORDER BY Order_Date). This is a window function that computes the sum of all sales up to the current row, in the order specified by Order_Date.
- The OVER clause specifies the window over which the SUM function is applied. In this case, it orders the rows by Order_Date without partitioning, meaning the running total is calculated over the entire result set
Our result looks like this:
Example #2. Ranking Products within Categories
Task: Assign a sales-based rank to each product within its category.
-- Selecting sales data and ranking products within each category by sales
SELECT Category, -- Category of the product Product_Name, -- Name of the product Sales, -- Sales amount RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) AS Sales_Rank, -- Ranking of sales within each category FROM dataacademykz.superstore.sales_analytics -- Source tableExplanation: This query assigns a rank to each product within its category based on its sales.
RANK() Window Function
RANK(): This function assigns a unique rank to each row within a partition of a result set. The rank of a row is one plus the number of ranks that come before the row. If two or more rows tie for a rank, they receive the same rank, with the next rank(s) skipped. So, if two products have the highest sales and tie for rank 1, the next product will have a rank of 3.
OVER Clause
PARTITION BY Category: This directive tells the RANK() function to reset the ranking when the category changes. In other words, each category will have its own set of ranks, starting from 1. This is similar to grouping but without collapsing the rows into a single output row for each group.
ORDER BY Sales DESC: Within each category, this orders the products by their sales in descending order (from highest to lowest). The RANK() function uses this order to assign ranks to each product.
If you run the query correctly, you will get the result like this one:
Example #3. Moving Average of Sales
Task: Determine a 3-day moving average of sales for each product.
-- Selecting sales data along with a moving average calculation for each product
SELECT Product_Name, -- Name of the product Order_Date, -- Date of the order Sales, -- Sales amount AVG(Sales) OVER ( PARTITION BY Product_Name -- Partitioning by Product Name ORDER BY Order_Date -- Ordering by date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS Moving_Avg_Sales, -- Moving average of sales for each product FROM dataacademykz.superstore.sales_analytics -- Source tableLet us explain what is happening in our window function:
AVG(Sales):
- This is an aggregate function that computes the average value of a set of numbers. In this case, it's being used to calculate the average sales amount.
OVER Clause:
- This defines a window over which the AVG function will operate. Instead of calculating the average over the entire dataset or a group of rows, it calculates the average over a specified range of rows related to the current row.
PARTITION BY Product_Name:
- This partitions the data into subsets, one for each product. The window function will restart its calculations for each product.
ORDER BY Order_Date:
This orders the rows within each partition by the order date. The order of rows is important because it determines which rows are considered preceding and following.
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING:
- This is a frame specification that defines the set of rows to include in the window for each row's calculation. For each row, the window will consist of:
- The current row.
- The row immediately preceding it (1 row back).
- The row immediately following it (1 row forward).
For example, if you look at Row #4, the moving average of sales calculates this way:
(Current row (25) + Preceding row (12) + Following row (5)) / 3 = 14.0