Tutorial (SQL): When to use GROUP BY, WHERE, and HAVING

Tools used in this project:
SQL PostgreSQL
Difficulty level:
Intermediate

This brief guide was created to provide a clear and intuitive introduction to three fundamental SQL clauses for aggregating rows in a data set.

How to demonstrate your skills in SQL.

In particular, if you are preparing for a job interview in data analysis or software development, you will likely be asked to demonstrate your skills in SQL. The most frequently discussed clauses are `GROUP BY,` `WHERE,` and `HAVING,` not only because of their importance in data aggregation but also because they are often confusing. Recruiters then use them to assess candidates’ analytical skills.

Introduction

Aggregation is a key concept when working with large data sets, especially in areas such as environmental analysis or sustainability. For example, when analyzing climate change data, it may be more informative to examine trends for different countries or socioeconomic groups rather than focusing on a global average. This tutorial will use the SQL `GROUP BY,` `WHERE,` and `HAVING` clauses to perform these detailed analyses, a topic frequently explored during job interviews in various industries.

For the practical exercises in this tutorial, we will use a database focused on climate analysis. This database contains various socio-economic and climate metrics for multiple countries worldwide. The main table, called `climate,` includes critical data such as adaptation to water technologies, trade volume, days to register a property, GDP per capita, labor employment rate, school enrollment rate, and water stress index. This dataset thus provides a comprehensive and detailed overview of the global challenges associated with climate change.

The general list of data in the climate table.

Using GROUP BY

The `GROUP BY` clause aggregates data based on specific columns. The aggregation process occurs in three stages:

  • Split: data are divided into groups based on the unique values in the specified column(s).
  • Apply: for each group, we perform an aggregate function such as average (`AVG`), minimum (`MIN`), or maximum (`MAX`).
  • Combine: the aggregated results are then combined into a single table.

For example, to identify the top five countries with the highest average water stress over the years:

SELECT 
 country, 
 AVG(water_stress_index) AS avg_water_stress
FROM climate
GROUP BY country
ORDER BY avg_water_stress DESC
LIMIT 5;
Query to identify the top five countries with the highest average water stress over the years.

In this query:

  • We divide the data into groups based on the unique values in the `country` column.
  • We apply the aggregate average function (`AVG`) to each group’s `water_stress_index` column.
  • We combine the results into a table, sorting the countries by average water stress index in descending order.
  • Limitiamo i risultati ai primi 5 paesi.

Instead, to calculate the average GDP per capita, the total number of years recorded, and the average water stress index for each country:

SELECT 
 country,
 AVG(gdp_per_capita) AS avg_gdp_per_capita,
 COUNT(year) AS total_years,
 AVG(water_stress_index) AS avg_water_stress
FROM climate
GROUP BY country
ORDER BY avg_gdp_per_capita DESC;
Use a query to calculate the average GDP per capita, total number of recorded years, and average water stress index for each country.

In this query:

  • We divide the data into groups based on the unique values in the `country` column.
  • We apply several aggregate functions: the average (`AVG`) for the `gdp_per_capita` and `water_stress_index` columns and the count (`COUNT`) for the `year` column.
  • We combine the results into a table, sorting the countries by average GDP per capita in descending order.

The limitations of WHERE

As we can see, the previous results have null values that could make the analysis incorrect, as in the case of the first query. To remove them, it is necessary to proceed with data filtering.

The `WHERE` clause is often used to filter data. However, aggregate functions cannot be combined with `WHERE.`

SELECT 
 country, AVG(water_stress_index)
FROM climate
WHERE AVG(water_stress_index) > 0.5
GROUP BY country;

This query will return the following error:

Query generating an error due to inability to pass aggregate functions in WHERE clause.

The error is due to the inability to pass aggregate functions in the `WHERE` clause. This is because, in the order of execution of SQL clauses, `WHERE` is processed before data aggregation, making it impossible to use aggregate functions at this stage.

To solve this problem, it is necessary to resort to another SQL clause: `HAVING.`

Using HAVING

`HAVING` overcomes the limitations of `WHERE` by allowing filtering of already aggregated data.
For example, to aggregate data by countries and filter out those with an average water stress index above a specific value:

SELECT 
 country,
 AVG(water_stress_index) AS avg_stress_index,
 SUM(gdp_per_capita) AS total_gdp,
 AVG(water_related_adaptation_tech) AS avg_adaptation
FROM climate
GROUP BY country
HAVING AVG(water_stress_index) > 0.5
ORDER BY avg_stress_index DESC;
Query to aggregate data by country and filter out those with an average water stress index above a specific value.

In this case, the query does not generate any error. It is important to note that, unlike the `WHERE` clause, `HAVING` allows the use of aggregate functions for filtering data.

To understand more, consider another example: We want to identify countries with an average GDP per capita above 20,000 and a water stress index below 0.4.

SELECT 
 country,
 AVG(gdp_per_capita) AS avg_gdp,
 AVG(water_stress_index) AS avg_stress
FROM climate
GROUP BY country
HAVING AVG(gdp_per_capita) > 20000 AND AVG(water_stress_index) < 0.4
ORDER BY avg_gdp;
Query to identify countries with an average GDP per capita above 20,000 and a water stress index below 0.4.

Again, the query generates no errors, further demonstrating the flexibility of the `HAVING` clause in working with aggregate data.

Recall that using aliases in aggregate functions within the `HAVING` clause can generate errors since the definition of aliases occurs only after the execution of `HAVING` during the `SELECT` phase.

SELECT 
 country, 
 AVG(gdp_per_capita) AS avg_gdp
FROM climate
GROUP BY country
HAVING avg_gdp > 20000
ORDER BY avg_gdp;
Query returning an error due to using ALIAS in the HAVING clause.

The query returns an error due to using the alias ‘avg_gdp’ in the `HAVING` clause. The alias is generated with the `SELECT` clause, however, which is executed after `HAVING,` hence the source of the error. In practice, when `HAVING` is executed, the name of the new column “avg_gdp” does not yet exist; therefore, query filtering cannot proceed. The problem is solved by modifying the query by removing the alias reference:

SELECT 
 country, 
 AVG(gdp_per_capita) AS avg_gdp
FROM climate
GROUP BY country
HAVING AVG(gdp_per_capita) > 20000
ORDER BY avg_gdp;
Query with correct processing of HAVING clause, without using ALIAS.

This time, the query returned no error, and we could filter the data correctly.
Thus, one must never forget that column aliases cannot be used in `HAVING` because their generation occurs later with `SELECT.`

Finally, going back to the first query of the exercise, by adding the clause `HAVING AVG(water_stress_index) IS NOT NULL` we can exclude from the final output the results with an average water stress value of null, obtaining the correct list:

Query used at the beginning of the exercise with the addition of the clause HAVING AVG(water_stress_index) IS NOT NULL to exclude from the final output results with a zero mean water stress value.

Execution order in SQL

Understanding the order in which SQL clauses are executed can help you avoid common mistakes, such as those discussed above, and is a detail that could also come in handy during a job interview.

The actual order of execution is as follows:

  1. `FROM`,
  2. `WHERE`,
  3. `GROUP BY`,
  4. `HAVING`,
  5. `SELECT`,
  6. `ORDER BY`,
  7. `LIMIT`.

However, when we write a query, we tend to use the following sorting, which differs from the actual sorting:

  1. `SELECT`,
  2. `FROM`,
  3. `WHERE`,
  4. `GROUP BY`,
  5. `HAVING`,
  6. `ORDER BY` ,
  7. `LIMIT`.

Query execution begins with `FROM,` which indicates the tables to be used, and `SELECT` is executed only after `HAVING.` This explains why `HAVING` cannot use aliases, while `ORDER BY` can: `ORDER BY` is executed after `SELECT,` when aliases have already been defined. This understanding of the execution order also explains why `HAVING` is used with `GROUP BY` to apply conditions on aggregated data, while `WHERE` cannot.

Conclusion

`WHERE` and `HAVING` differ when they filter the data: `WHERE` does it before aggregation, while `HAVING` does it after. This distinction, along with understanding the order of execution of SQL clauses in a query, is particularly valued in job interviews. This kind of knowledge is often a plus for “entry-level” positions, whereas for “senior” positions, a solid understanding of these concepts is generally considered a basic requirement, and errors in this area are not tolerated.

Important Note: All SQL queries presented in this tutorial were optimized and tested using PostgreSQL. If you use a different database management system, such as MySQL or SQL Server, some modifications may be necessary to ensure compatibility and proper query operation. For example, SQL Server, since version 2005, permits the use of aliases in the `HAVING` clause, while MySQL and SQL Server are not case-sensitive in table and column names.

Video


If you are interested in learning more about this topic or have specific questions, please get in touch with me using the references on my contact page. I will happily answer your questions and provide more information about my work as a Data Analyst. Thank you for taking the time to visit my website and for your interest in my work.

News tag:
Scroll to Top