Suppose we have 2 tables called Orders and Salesperson shown below:
Salesperson
|
Orders
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Now suppose that we want to write SQL that must conform to the SQL standard.
We want to retrieve the names of all salespeople that have more than 1 order from the tables above. You can assume that each salesperson only has one ID.
If that is the case, then what (if anything) is wrong with the following SQL?:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The answer and explanation to advanced SQL question 1
There is definitely something wrong with the SQL above, and it is probably something that most beginner SQL programmers may not notice. The problem is that the SQL Standard says that we can not select a column that is not part of the group by clause unless it is also contained within an aggregate function. If we try to run the SQL above in SQL Server, we would get an error that looks like this:
Column 'Name' is invalid in the select list because it is
not contained in either an aggregate function or
the GROUP BY clause.
You might be confused now, so let’s explain what that error means in plain English and through some simple examples. The most important thing you should take out of this discussion is understanding exactly why we get that error, and how to avoid it. There is a good reason for the error – read on to understand why.
You can see in the bad SQL above that the “Name” column is clearlynot also a part of the group by statement, nor is it contained within an aggregate function (like SUM, MAX, etc).
As the error above suggests, we can fix the error by either wrapping the Name column inside an aggregate function or adding it to the Group By clause.
So if we want to write SQL that complies with the standard, then we could write something like this by adding the Name column to the Group By:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id, Name
-- we added the name column to the group by, and now it works!
HAVING COUNT( salesperson_id ) >1
The SQL above will run just fine without giving any error.
We could also fix the problem by putting the Name column in any aggregate function, and then simply make that a part of our select statement. So, we could just write this SQL instead, and it would be perfectly legal according to the SQL standard. We chose to use the MAX aggregate function, but any other aggregate would work just fine as well:
SELECT MAX(Name) --put name in an aggregate function
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
Adding the Name column to the group by, or wrapping the Name column in an aggregate will certainly fix the error – but it’s very important to note that both of those things will change the data that is returned to a state that you may not want.
Why does the selected column have to be in the group by clause or part of an aggregate function?
So, now you understand how to fix the error – but do you understand why it is a problem in the first place? Well, you should – because that is the most important thing to understand! So, let’s explain some more about why SQL gives that error shown above .
First off, let’s talk a little bit more about aggregate functions. You probably know what aggregate functions in SQL are – we used one in the example above. In case you forgot, aggregate functions are used to perform a mathematical function on the values inside a given column, which is passed into the aggregate function. Here are some of the commonly used aggregate functions:
AVG() - Returns the average value
COUNT() - Returns the number of rows
FIRST() - Returns the first value
LAST() - Returns the last value
MAX() - Returns the largest value
MIN() - Returns the smallest value
SUM() - Returns the sum
To illustrate why the SQL standard says that a selected column has to be in the group by clause or part of an aggregate function, let’s use another example. Suppose we have some tables called Starbucks_Stores and Starbucks_Employees. In case you don’t already know, Starbucks is a popular coffee shop/cafe in the USA:
Starbucks_Employees
|
Starbucks_Stores
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Now, given the tables above let’s say that we write some SQL like this:
SELECT count(*) as num_employees, HourlyRate
FROM Starbucks_Employees JOIN Starbucks_Stores
ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id
GROUP BY city
It looks like the SQL above would just return the number of Starbucks employees in each city, along with the HourlyRate – because it will group the employees based on whatever city they work in (thanks to the “group by city” statement).
The problem with selecting a non-aggregate column that is not in the group by
But the real question here is what exactly would be returned for the HourlyRate in the SQL above? Would it return every employee’s hourly rate separated by commas? Since we group by city, will it return the highest hourly rate for each city? Will it return the hourly rate as a distinct list, so those 2 guys making 11 dollars an hour will have the 11 returned only once?
The problem here is that we do not know what will be returned because we are notspecific enough with what we are asking for in the SQL! If what we are asking for is not specific enough, then the SQL processor will not know what to return.
This is why almost all database implementations return an error when the SQL above is run (with the notable exception of MySQL) – and this is why the SQL does not conform to the Standard. In SQL Server running the SQL above will return the same error that we showed earlier.
Let’s explain even further in case the problem with that SQL is not crystal clear. The order of operations in which things will happen with the SQL above is:
1. The 2 tables are joined on the condition that the
Starbucks_Employees.StoreID column value is equal to the
Starbucks_Stores.store_id column values.
2. Groups are then created for each city - which means that
each distinct city will have it's own "group". So, there will
be a total of 3 groups one each for San Francisco, New York,
and Los Angeles.
3. The data we are interested in is selected from each group
that is created in step 2.
Because we end up with different groups based on the city, when we select a count(*), that will find the total count of rows in each and every group. But, the problem is that when we select HourlyRate, there will be multiple values for the HourlyRate within each group. For example, for the group created by the city of San Francisco there will be 4 different values for the HourlyRate – 14, 10, 11, and 13.
So the question is which value of the HourlyRate should be selected from each group? Well, it could be any one of those values – which is why that SQL results in an error. This is because what we are asking for is NOT specific enough – hopefully this is crystal clear now to you.
If the same HourlyRate were part of an aggregate function like MAX then it would simply return the highest HourlyRate within each group. And that is why having an aggregate function would fix the SQL error – because only one value will be selected from any given group.
So, this SQL is perfectly fine because we are more specific in what we ask for – but this SQL would only work for you if you actually want the highest HourlyRate for each city:
SELECT count(*) as num_employees, MAX(HourlyRate)
FROM Starbucks_Employees JOIN Starbucks_Stores
ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id
GROUP BY city
Fix the error by adding column to the group clause
Another way to fix the error is to simply add the HourlyRate column to the group by clause. This also means that having the HourlyRate column wrapped in aggregate function is no longer necessary. So you could write some SQL like this and it would fix the error:
SELECT count(*) as num_employees, HourlyRate
FROM Starbucks_Employees JOIN Starbucks_Stores
ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id
GROUP BY city, HourlyRate
This would then create groups based on the unique combination of the values in the HourlyRate and City columns. This means that there will be a different group for each HourlyRate and City combination – so $11, San Francisco and $11, Los Angeles will be 2 different groups.
With the SQL above, each group will only have one value for the HourlyRate, which also means that there will be no ambiguity or confusion when selecting the HourlyRate since there is only possible value to select. It is now very clear that one and only one HourlyRate value can be returned for each group.
Adding the column to the group by clause fixes the error but will alter the data that is returned
But, one very important thing to note is that even though adding the column to the group by will fix the error, it will also change the groups that are created. This means that the data returned will be completely different from what was returned before. So, the count(*) function will no longer return the count of employees in a given city, and will instead return the number of rows in each group created by the unique combination of the HourlyRate and city columns.
MySQL – selecting non-aggregate columns not in the group by
One very important thing that you should know is that MySQL actually allows you to have non-aggregated columns in the select list even if they are not a part of the group by clause (a quick side note: a non-aggregated column is simply a column that is notwrapped within an aggregate function). What this means is that you will not receive an error if you try to run any of the “bad” SQL above in MySQL. The reason it is allowed in MySQL is because MySQL assumes that you know what you are doing – and it does actually make sense in some scenarios. For instance, let’s refer back to the SQL that we started with:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The reason the original SQL code (presented above) works just fine in MySQL is because there is a 1 to 1 mapping of salesperson name to ID – meaning that for every unique salesperson ID there is only one possible name. Another way of saying that is that each salesperson can only have one name. So when we create groups (which is done in the “GROUP BY salesperson_id”) based on the salesperson ID, each group will only have one and only one name.
This SQL will also run just fine in MySQL without returning an error:
SELECT count(*) as num_employees, HourlyRate
FROM Starbucks_Employees JOIN Starbucks_Stores
ON Starbucks_Employees.StoreID = Starbucks_Stores.store_id
GROUP BY city
But, even though the code above will not return an error, the HourlyRate that is returned by MySQL will be some arbitrary (random) value within each group. This is because when we create each group based on the city, each group can have different values for the HourlyRate.
In other words, there is no one to one mapping between the HourlyRate and the city like we had before with the salesperson ID and the name. So, because we are not being specific as to which HourlyRate we want, MySQL will return an arbitrary value . For instance, in the group created by the city of San Francisco, MySQL could return the HourlyRate for any employee who works in San Francisco – whether it is 14, 10, 11, or 13 we don’t really know since it is arbitrary/random in MySQL.
No comments:
Post a Comment