Selectivity and Index Performance

Let’s look into selectivity, as this is an important topics when looking at index performance. (Oooh, I said “performance”, watch everyone’s ears perk up!).

This will probably answer the questions “Why isn’t MySQL using my index?” or “Why is my query so slow when I have an index on it?”

Selectivity describes how different values of a field are. It is a number from 0-1, although you can also think of it as a percentage. A value of 1, or 100%, means that each value in the field is unique. This happens with UNIQUE and PRIMARY keys, although non-unique fields may have a selectivity of 1 — for example, a timestamp value in a not-often-used table.

To calculate this, you take the total number of DISTINCT records and divide by the total number of records.

My company has a large Users table, so I grabbed some statistics off of that:

+----------+ | count(*) | +----------+ | 817666 | +----------+ 1 row in set (0.63 sec)

+--------------------------+ | count(distinct username) | +--------------------------+ | 817666 | +--------------------------+ 1 row in set (1.63 sec)

So the selectivity is 81766/81766, or 1. If this were not a UNIQUE KEY already, it’s a good candidate for one.

the “created” field is a timestamp for when the user record was created
+-------------------------+ | count(distinct created) | +-------------------------+ | 811227 | +-------------------------+ 1 row in set (2.04 sec)

As I expect, there are *some* duplicates, but for the most part, everyone has a different creation time. 811227/817666 = 0.99.

+--------------------+ | count(distinct IP) | +--------------------+ | 544694 | +--------------------+ 1 row in set (2.35 sec)

This is interesting — lots of people logon from public places, or use their friends’ computers, so there are duplicate IP’s (the last IP used to login is associated with a user, so it’s a 1-to-1 relationship). The selectivity here is 0.67.

+-------------------------+ | count(distinct browser) | +-------------------------+ | 25699 | +-------------------------+ 1 row in set (1.70 sec)
This is what the server reports the user’s browser is. It records the last browser used by the user. This gives us about a 0.03 for selectivity.

+---------------------+ | count(distinct age) | +---------------------+ | 83 | +---------------------+ 1 row in set (0.63 sec)

There are only 83 different reported ages on our site. That makes the selectivity of age 0.000101508. That is very low, effectively zero.

So why is this important? I’m glad you asked….

MySQL has a cost-based optimizer. This means that MySQL calculates the costs of different ways of performing a query and then chooses the cheapest one. Sounds reasonable, right? Well, calculating the costs is an inexact science. In order to calculate the exact cost, the optimizer would actually have to run the query. So an estimate is taken, and the estimate is wrong sometimes. Most of the time the estimate is correct.

In contrast, some database systems allow a rule-based optimizer. This means that no matter what the data state, the database uses rules to figure out the “optimal” path to the query. In most enterprise-level database systems, a cost-based optimizer performs better than a rule-based optimizer. In other words, there are so many exceptions to the rules that the calculation overhead is worth it.

(Just to clarify, in both systems, the correct result set will be generated. The optimizer determines the path to the information.)

This cost-based optimizer uses selectivity information when it decides whether or not to use an index.

But what does this mean for me?

Well, in the example above, this means if you want to query folks by age or age group, it’s useless to put an index on it. It’s a waste of cpu time and disk I/O to have an index for something with such a low selectivity. The optimizer will NEVER use it.

I’ve heard that the optimizer will do a full table scan if it calculates it will return more than 30% of the table. Why is that number so low, why not more like 50 or 75%? Well, first the server has to go to the index, search the index, and find if the index record matches. Then it needs to follow the index record’s pointer to the real record on disk. And the MySQL gurus have decided that around 30% is the place where using the index is slower than just doing a full table scan.

(Note: I’m not exactly sure what the # is but I’ve heard it’s around 30%. As well, at the user conference I saw graphs that showed that for the most part this was true. I thought it was in Jay Pipes’ Performance Tuning presentation, but the graphs are not in the slides. Pointers are appreciated.)

So in this case, should we put an index on browser? Well, this is one of those cases where I’d think about how often we’d be doing queries and how much we care about server performance doing a report versus server performance while inserting or updating a record. If we really care one way or another, go that way. And document!

Another thing to consider is the nature of the data. For something like age, that’s not going to change. Sure, we might have some 120 year olds eventually, but there’s not going to be that much variance. For browsers, there will only be more and more types put out, considering different version numbers and OS configurations of standard browsers as well as mobile phone browsers.

However, if it does not matter, or if it’s too difficult to decide which is more important (your boss says “we can’t be slow during normal usage OR during reporting!”) I default to MySQL — it’s better at optimization than I am. I’d probably put an index so MySQL could decide whether to do a full table scan or use an index, particularly given that I expect the number of browsers to keep increasing.

For something like IP, that has a selectivity of 0.67, so putting an index on it is worth it if we query on IPs a lot.

I hope this article has been helpful!

Jay Pipes has a great article on using the INFORMATION_SCHEMA to find out selectivity:
http://tinyurl.com/kfffp