SQL’s analytic functions allow for complex calculations and deeper data insights
SQL’s analytic functions allow for complex calculations and deeper data insights
SQL’s analytic functions allow for complex calculations and in-depth analysis. They operate on rows in a query result set that are related to the current one.
I’ve prepared a thorough examination of common analytic functions for you.
These functions are intended to operate on a set of rows within a query result set that are related to the current row.
Understanding these functions will allow you to gain valuable insights into your data and make sound decisions.
Let’s use the table below as an example for the SQL command demonstration.
employee_id | department | employee_name | salary |
1 | HR | Alice | 50000 |
2 | HR | Bob | 52000 |
3 | HR | Carol | 48000 |
4 | IT | David | 60000 |
5 | IT | Emma | 65000 |
6 | Finance | Frank | 55000 |
7 | Finance | Grace | 58000 |
NTILE(n) divides the result set into roughly equal-sized groups or “tiles,” each with its own group number. This function can be used to calculate quartiles or percentiles.
Examples
SELECT value, NTILE(4) OVER (ORDER BY value) AS quartile FROM dataset;
This query divides the dataset into four quartiles based on the value
column
department | employee_name | salary | quartile |
HR | Alice | 50000 | 1 |
HR | Carol | 48000 | 1 |
HR | Bob | 52000 | 2 |
IT | David | 60000 | 3 |
IT | Emma | 65000 | 4 |
Finance | Frank | 55000 | 3 |
Finance | Grace | 58000 | 4 |
PERCENTILE_CONT
calculates the value at a specified percentile within a group of rows. This is particularly helpful for finding the median or other specific percentiles.Example
SELECT department, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees
GROUP BY department;
This query finds the median salary for each department.
department | median_salary |
HR | 50000 |
IT | 60000 |
Finance | 56500 |
PERCENTILE DISC computes the value at a specified percentile within a group of rows, but instead of interpolated values, it returns an actual data value from the dataset. It can be used to find discrete percentiles.
SELECT department, PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY salary) AS first_quartile_salary
FROM employees
GROUP BY department;
This query finds the value at the first quartile (25th percentile) of salaries for each department.
department | first_quartile_salary |
HR | 49000 |
IT | 60000 |
Finance | 55000 |
The cumulative distribution of a value within a group of rows is calculated by CUME DIST, indicating the relative position of a row within the group.
Example:
SELECT department, employee_name, salary, CUME_DIST() WITHIN GROUP (ORDER BY salary DESC) AS cumulative_salary_dist
FROM employees;
This query displays the cumulative distribution of salaries within the employees’ table, ordered by salary in descending order.
department | employee_name | salary | cumulative_salary_dist |
IT | Emma | 65000 | 0.4285714286 |
Finance | Grace | 58000 | 0.8571428571 |
IT | David | 60000 | 0.2857142857 |
Finance | Frank | 55000 | 0.5714285714 |
HR | Bob | 52000 | 1 |
HR | Alice | 50000 | 0.8571428571 |
HR | Carol | 48000 | 0.4285714286 |
The Lag() and Lead() functions allow you to access values from rows preceding or following a result set. They are frequently used to calculate data shifts or patterns.
SELECT date, revenue, LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue
FROM daily_sales;
This query retrieves the revenue for each day and the revenue for the previous day.
department | employee_name | salary | prev_employee_salary | next_employee_salary |
HR | Alice | 50000 | 52000 | |
HR | Bob | 52000 | 50000 | 48000 |
HR | Carol | 48000 | 52000 | |
IT | David | 60000 | 65000 | |
IT | Emma | 65000 | 60000 | |
Finance | Frank | 55000 | 58000 | |
Finance | Grace | 58000 | 55000 |
The functions First_Value() and Last_Value() return the first or last value within a group of rows in the specified order.
Example:
SELECT department, employee_name, salary,
First_Value(employee_name) OVER (PARTITION BY department ORDER BY salary) AS lowest_paid_employee,
Last_Value(employee_name) OVER (PARTITION BY department ORDER BY salary) AS highest_paid_employee
FROM employees;
This query finds the lowest- and highest-paid employees within each department.
department | employee_name | salary | lowest_paid_employee | highest_paid_employee |
HR | Alice | 50000 | 48000 | 52000 |
HR | Bob | 52000 | 48000 | 52000 |
HR | Carol | 48000 | 48000 | 52000 |
IT | David | 60000 | 60000 | 65000 |
IT | Emma | 65000 | 60000 | 65000 |
Finance | Frank | 55000 | 55000 | 58000 |
Finance | Grace | 58000 | 55000 | 58000 |
Analytic functions are versatile data analysis and reporting tools that allow you to perform a wide range of calculations within specific groups or ordered sets of data.
These examples illustrate how common analytic functions operate on a dataset and provide valuable insights into data distribution, trends, and percentiles. Analytic functions are powerful tools for data analysis, reporting, and decision-making in SQL.
ANCOVA is an extension of ANOVA (Analysis of Variance) that combines blocks of regression analysis and ANOVA. Which makes it Analysis of Covariance.
What if we learn topics in a desirable way!! What if we learn to write Python codes from gamers data !!
Start using NotebookLM today and embark on a smarter, more efficient learning journey!
This can be a super guide for you to start and excel in your data science career.
A method to find a statistical relationship between two variables in a dataset where one variable is used to group data.
Seaborn library has matplotlib at its core for data point visualizations. This library gives highly statistical informative graphics functionality to Seaborn.
The Matplotlib library helps you create static and dynamic visualisations. Dynamic visualizations that are animated and interactive. This library makes it easy to plot data and create graphs.
This library is named Plotly after the company of the same name. Plotly provides visualization libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Numpy array have functions for matrices ,linear algebra ,Fourier Transform. Numpy arrays provide 50x more speed than a python list.
Numpy has created a vast ecosystem spanning numerous fields of science.
Pandas is a easy to use data analysis and manipulation tool. Pandas provides functionality for categorical,ordinal, and time series data . Panda provides fast and powerful calculations for data analysis.
In this tutorial, you will learn How to Access The Data in Various Ways From the dataframe.
Understand one of the important data types in Python. Each item in a set is distinct. Sets can store multiple items of various types of data.
Tuples are a sequence of Python objects. A tuple is created by separating items with a comma. They are put inside the parenthesis “”(“” , “”)””.
One response to “SQL Analytic Functions”
[…] 4. Analytic Functions: […]
Points You Earned