SQL Antipatterns — Bill Karwin

Well, I came late, so I missed the first one….so we start with #2

#2. Ambiguous GROUP BY —

query BUGS and include details on the corresponding PRODUCT rows —

SELECT b.bug_id, p.product_name from bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

We use the GROUP BY to get one row per bug, but then you lose information.

Standard SQL says that GROUP BY queries require the GROUP BY columns to be in the SELECT clause. MySQL does not enforce this. If a column is in a SELECT clause but not in the GROUP BY clause it displays a random value.

[my note, not said in the presentation this fools people when they want the groupwise maximum, they think that selecting multiple columns and grouping means that you get some particular row ]

Solution 1: Restrict yourself to standard SQL — do not allow columns in SELECT if
use GROUP BY

Solution 2: Use GROUP_CONCAT() to get a comma-separated list of distinct values in the row.

SELECT b.bug_id, GROUP_CONCAT(p.product_name) AS product_names FROM bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

Performance: no worse than doing a regular group function because the concat happens after the grouping is done.

3. EAV Tables — Entity-Attribute-Value Tables.

Example: product catalog w/ attributes, too many to use one column per attribute. Not every product has every attribute. ie, DVD’s don’t have pages and books don’t have a region encoding.

Most people make an “eav” table, that has the attribute name and value and the entity name. It associates the entity name (say, “spiderman DVD”) with an attribute (“region encoding”) and value (“region 1”)

Why is this bad? It’s harder to apply constraints because the column may have many different values (ie, # of pages should be a number but region encoding is a character). This may be a sign of a bad data model.

So why is this bad?

EAV cannot require an attribute — if you were doing many columns per table, you could specify NOT NULL (ie, price). Well, you could do that with TRIGGERs, but MySQL does not raise errors or abort an operation that spawned a trigger — in other words, you can’t stop the row from being inserted, just have an event when a row is inserted.

EAV cannot have referential integrity to multiple lookup tables, or only for some values.

It’s also expensive and complex to find all the attributes for one entity. In order to get one row that looks like normalized data, you need one join per attribute, and you may not even know how many there are.

Solution: Try not to use EAV tables, defining your attributes in your data model (ie, one table per attribute type). If you do, application logic should enforce constraints. Don’t try to fetch attributes in a single row (that looks like normalized data); fetch multiple rows and use the application code to reconstruct the entity.

4. Letting users crash your server
Example: people request ability to query database flexibility. So the antipattern is to give them access to run their own SQL.

Solution: give an interface which allows parameters to queries. But watch out for SQL injection!

Filter input escaping strings, or use parameterized queries.

6. Forcing primary keys to be contiguous

Example: managers don’t like gaps in invoice #’s. Antipattern is to try to reuse primary key values to fill in the gaps. Another antipattern is to change values to close the gaps.

Solution — deal with it. Do not reuse primary keys. Also, do not use auto_increment surrogate keys for everything if you do not need to.

Well, I came late, so I missed the first one….so we start with #2

#2. Ambiguous GROUP BY —

query BUGS and include details on the corresponding PRODUCT rows —

SELECT b.bug_id, p.product_name from bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

We use the GROUP BY to get one row per bug, but then you lose information.

Standard SQL says that GROUP BY queries require the GROUP BY columns to be in the SELECT clause. MySQL does not enforce this. If a column is in a SELECT clause but not in the GROUP BY clause it displays a random value.

[my note, not said in the presentation this fools people when they want the groupwise maximum, they think that selecting multiple columns and grouping means that you get some particular row ]

Solution 1: Restrict yourself to standard SQL — do not allow columns in SELECT if
use GROUP BY

Solution 2: Use GROUP_CONCAT() to get a comma-separated list of distinct values in the row.

SELECT b.bug_id, GROUP_CONCAT(p.product_name) AS product_names FROM bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

Performance: no worse than doing a regular group function because the concat happens after the grouping is done.

3. EAV Tables — Entity-Attribute-Value Tables.

Example: product catalog w/ attributes, too many to use one column per attribute. Not every product has every attribute. ie, DVD’s don’t have pages and books don’t have a region encoding.

Most people make an “eav” table, that has the attribute name and value and the entity name. It associates the entity name (say, “spiderman DVD”) with an attribute (“region encoding”) and value (“region 1”)

Why is this bad? It’s harder to apply constraints because the column may have many different values (ie, # of pages should be a number but region encoding is a character). This may be a sign of a bad data model.

So why is this bad?

EAV cannot require an attribute — if you were doing many columns per table, you could specify NOT NULL (ie, price). Well, you could do that with TRIGGERs, but MySQL does not raise errors or abort an operation that spawned a trigger — in other words, you can’t stop the row from being inserted, just have an event when a row is inserted.

EAV cannot have referential integrity to multiple lookup tables, or only for some values.

It’s also expensive and complex to find all the attributes for one entity. In order to get one row that looks like normalized data, you need one join per attribute, and you may not even know how many there are.

Solution: Try not to use EAV tables, defining your attributes in your data model (ie, one table per attribute type). If you do, application logic should enforce constraints. Don’t try to fetch attributes in a single row (that looks like normalized data); fetch multiple rows and use the application code to reconstruct the entity.

4. Letting users crash your server
Example: people request ability to query database flexibility. So the antipattern is to give them access to run their own SQL.

Solution: give an interface which allows parameters to queries. But watch out for SQL injection!

Filter input escaping strings, or use parameterized queries.

6. Forcing primary keys to be contiguous

Example: managers don’t like gaps in invoice #’s. Antipattern is to try to reuse primary key values to fill in the gaps. Another antipattern is to change values to close the gaps.

Solution — deal with it. Do not reuse primary keys. Also, do not use auto_increment surrogate keys for everything if you do not need to.

Comments are closed.