Is the query cache useful?

Mark Callaghan posted a good test of the MySQL query cache in different versions. His tests clearly show that in 5.0.44 and 5.0.84 and 5.1.38, there is more query throughput when the query cache is disabled.

However, the tests are skewed — not on purpose, I am sure, and Mark admits he has not used the query cache before — but they are skewed all the same. Mark’s error was that he assumed he could just turn on the query cache and see if it works. Most features of MySQL do not work that way — you have to understand the strengths and weaknesses of the feature in order to use it properly.

Mark’s benchmark definitely reinforces that turning on the query cache without any knowledge of your system is a bad idea, and I agree with him on that. But it does not in any way mean that the query cache is always a bad idea. In fact, the MySQL manual page at http://dev.mysql.com/doc/refman/5.1/en/query-cache.html has some caveats about when the query cache is useful and when it is not useful.

It is important to know how the MySQL query cache works, so I will first explain that, and then explain why Mark’s test is not a very good broad generalization. MySQL’s query cache is not useful all the time, but it can be quite useful! If you are already familiar with the query cache, you can skip ahead to why Mark’s benchmark is skewed.


How to use the MySQL query cache
Firstly, there are a few ways to use the MySQL query cache. You can disable it, which means no queries are cached; you can set it to cache all queries, and use a special keyword to not cache queries (replace SELECT with SELECT SQL_NO_CACHE). query_cache_type=1 or ON will set the query cache in this way.

You can also set it to cache no queries except those that have a special keyword — you guessed it, replace SELECT with SELECT SQL_CACHE and a query will attempt to be cached. To set the query cache in this way, set query_cache_type=2 or DEMAND.

What queries are cached
Regardless of which method you choose, not all possible queries are cached. Only deterministic SELECT statements are cached — a statement such as SELECT count(*) from tbl is deterministic, while SELECT NOW() is non-deterministic for the purpose of the query cache.* Queries that use user-defined functions, temporary tables, LOCK IN SHARE MODE, FOR UPDATE, INFORMATION_SCHEMA or mysql system tables are not cached. Most SELECT subqueries are not cached. SELECT queries that are part of stored functions, triggers, views and events are not cached. However, in MySQL 5.1, SELECT queries that contain a view can be cached. The MySQL manual page at http://dev.mysql.com/doc/refman/5.1/en/query-cache-operation.html has more detailed information on what is allowed and what is not allowed.

In the query cache, a hash of the query is stored along with the result. There is a variable, query_cache_limit, which sets an upper limit on the size of the result set that can be cached. So even if a query can be cached, it may not be cached if the result set is larger than query_cache_limit.

How the query cache is used
The lifecycle of a non-prepared SELECT statement when the query cache is on (query_cache_type > 0) is:

1) hash the SELECT query
2) check the hash against the query cache
3) if hash matches, returned the result set from the query cache. If not, the query is sent to the parser and executed in the usual manner, and results are stored in the query cache if the query is cachable.

This means that having the query cache on has some overhead, even if no queries are ever cached.

A query (and its result) is removed from the query cache when it is invalidated or pruned. A query is invalidated when any table it refers to has been changed, even if the changes do not affect the actual result set in the query cache. A query is pruned when a new query needs to be inserted into the query cache and there is no room; the query cache prunes queries using a least recently used (LRU) algorithm.

In other words, there will be a lot of unnecessary overhead with the query cache if a table is often written to, because queries using that table will be constantly removed from the query cache. This does not mean that if you have some tables that are frequently written to, you should turn the query cache off — you can use the query_cache_type of ON with the SQL_NO_CACHE keyword to cache all except some queries, or you can use the query_cache_type of DEMAND and the SQL_CACHE keyword in SELECT queries to cache no queries except those you specify.

Why Mark’s Benchmark is skewed

It is important to note that InnoDB was set up with:
innodb_buffer_pool_size=2000M

and the sysbench program was run with:

../sysbench --test=oltp --oltp-table-size=2000000 --max-time=180 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-read-only --num-threads=$nt run

I will explain what the important sysbench variables are in context.

1) The benchmark used sysbench in OLTP mode against an InnoDB table. The InnoDB buffer pool is hot after the table is created and populated, while the query cache is empty and cold.

InnoDB has its own cache — the InnoDB buffer pool. sysbench is run with –oltp-read-only, meaning there are no INSERT, DELETE or UPDATE statements. This is good; it means that the results are not skewed by query cache invalidations, though there may be pruning.

The first thing sysbench does is preparation — it creates the table and populates it with data (2,000,000 rows as per oltp-table-size). After this is done, the InnoDB buffer pool (2000M in size) has gone from empty to containing data. I do not know how large the row size of the sysbench table is; however, I can do some calculations:

2000 Mb * 1024 = 2,048,000 Kb / 2,000,000 rows = 1.024 Kb per row

This means that in order for some data to *not* be current in the InnoDB buffer pool, the row size had to be approximately larger than 1 Kb per row.

The manual at http://sysbench.sourceforge.net/docs/#database_mode shows the schema of the table that is used.

CREATE TABLE `sbtest` ( `id` int(10) unsigned NOT NULL auto_increment, `k` int(10) unsigned NOT NULL default ‘0′, `c` char(120) NOT NULL default ”, `pad` char(60) NOT NULL default ”, PRIMARY KEY (`id`), KEY `k` (`k`)

(note that you at least need to add a final parenthesis “)” for the query to work, and should also specify ENGINE=INNODB if you want to test on your system).

So, that’s two 4-byte integers, and two character fields…for the sake of argument, let’s say you are using UTF-8 in MySQL 5.1, where utf8 is 3 bytes per character, that’s:

4 + 4 + 120*3 + 60*3 = 548 bytes

Let’s say you are using a version of MySQL that uses 4 bytes per character for utf8 (I know 6.0 was using that, not sure about 5.4):

4 + 4 + 120*4 + 60*4 = 728 bytes

So either way, all of the data in the table is already loaded in the InnoDB buffer pool when the test begins. The InnoDB buffer pool is already hot, whereas the query cache is empty.

On systems where a query can be cached and used instead of executing and having to look on disk for data, using the MySQL query cache will give you *much* better performance.

2) The query cache statistics are not given. It’s very possible that the only thing that was measured was how much overhead the query cache provides if it is never used.

The queries that are run are the default (oltp-test-mode is not set, and it defaults to “complex”) which can be seen under the “Advanced transactional” heading in the sysbench manual at http://sysbench.sourceforge.net/docs/#database_mode. Only read queries are done, so the queries are:

SELECT c FROM sbtest WHERE id=N
SELECT c FROM sbtest WHERE id BETWEEN N AND M
SELECT SUM(K) FROM sbtest WHERE id BETWEEN N and M
SELECT c FROM sbtest WHERE id between N and M ORDER BY c
SELECT DISTINCT c FROM sbtest WHERE id BETWEEN N and M ORDER BY c

The values for the variables in the query are chosen at random, though they are sensible values. The testing time is very short, only 3 minutes of running time (sysbench is called with max-time=180). It is quite possible that all the SELECT queries were put into the query cache and there was never a query cache hit. Of course there will be a lot of overhead if the query cache is never used! Particularly if it has to prune for more memory.

I would like to see Qcache_hits, Qcache_inserts, Com_select (number of SELECT statements that did not use the query cache) and Qcache_lowmem_prunes. That is what I look at for any system that has the query cache running, so I can determine the effectiveness of the query cache.

There is more in-depth information on how to find the usefulness of the query cache, query cache fragmentation, and a few ways to cache more queries in the MySQL Administrator’s Bible on pages 427-434 — part of the “Caching with MySQL” chapter. I used the book to help me write this blog post, so I do think it is fair to cite my sources….even though it happens to be a source I created (and that’s why I consider it a shameless plug, I have no shame citing my sources)!

—————-

So that all being said, the query cache is extremely useful when it is used for repeated queries. The benchmark Mark performed most likely did not actually use this. I have seen production environments fall apart when someone turned on on the query cache for all queries; Mark’s benchmark proves that it is not a good idea to randomly turn on the query cache. You have to know what you’re doing.

* for the purpose of replication, it is deterministic, but that’s another topic.

Mark Callaghan posted a good test of the MySQL query cache in different versions. His tests clearly show that in 5.0.44 and 5.0.84 and 5.1.38, there is more query throughput when the query cache is disabled.

However, the tests are skewed — not on purpose, I am sure, and Mark admits he has not used the query cache before — but they are skewed all the same. Mark’s error was that he assumed he could just turn on the query cache and see if it works. Most features of MySQL do not work that way — you have to understand the strengths and weaknesses of the feature in order to use it properly.

Mark’s benchmark definitely reinforces that turning on the query cache without any knowledge of your system is a bad idea, and I agree with him on that. But it does not in any way mean that the query cache is always a bad idea. In fact, the MySQL manual page at http://dev.mysql.com/doc/refman/5.1/en/query-cache.html has some caveats about when the query cache is useful and when it is not useful.

It is important to know how the MySQL query cache works, so I will first explain that, and then explain why Mark’s test is not a very good broad generalization. MySQL’s query cache is not useful all the time, but it can be quite useful! If you are already familiar with the query cache, you can skip ahead to why Mark’s benchmark is skewed.


How to use the MySQL query cache
Firstly, there are a few ways to use the MySQL query cache. You can disable it, which means no queries are cached; you can set it to cache all queries, and use a special keyword to not cache queries (replace SELECT with SELECT SQL_NO_CACHE). query_cache_type=1 or ON will set the query cache in this way.

You can also set it to cache no queries except those that have a special keyword — you guessed it, replace SELECT with SELECT SQL_CACHE and a query will attempt to be cached. To set the query cache in this way, set query_cache_type=2 or DEMAND.

What queries are cached
Regardless of which method you choose, not all possible queries are cached. Only deterministic SELECT statements are cached — a statement such as SELECT count(*) from tbl is deterministic, while SELECT NOW() is non-deterministic for the purpose of the query cache.* Queries that use user-defined functions, temporary tables, LOCK IN SHARE MODE, FOR UPDATE, INFORMATION_SCHEMA or mysql system tables are not cached. Most SELECT subqueries are not cached. SELECT queries that are part of stored functions, triggers, views and events are not cached. However, in MySQL 5.1, SELECT queries that contain a view can be cached. The MySQL manual page at http://dev.mysql.com/doc/refman/5.1/en/query-cache-operation.html has more detailed information on what is allowed and what is not allowed.

In the query cache, a hash of the query is stored along with the result. There is a variable, query_cache_limit, which sets an upper limit on the size of the result set that can be cached. So even if a query can be cached, it may not be cached if the result set is larger than query_cache_limit.

How the query cache is used
The lifecycle of a non-prepared SELECT statement when the query cache is on (query_cache_type > 0) is:

1) hash the SELECT query
2) check the hash against the query cache
3) if hash matches, returned the result set from the query cache. If not, the query is sent to the parser and executed in the usual manner, and results are stored in the query cache if the query is cachable.

This means that having the query cache on has some overhead, even if no queries are ever cached.

A query (and its result) is removed from the query cache when it is invalidated or pruned. A query is invalidated when any table it refers to has been changed, even if the changes do not affect the actual result set in the query cache. A query is pruned when a new query needs to be inserted into the query cache and there is no room; the query cache prunes queries using a least recently used (LRU) algorithm.

In other words, there will be a lot of unnecessary overhead with the query cache if a table is often written to, because queries using that table will be constantly removed from the query cache. This does not mean that if you have some tables that are frequently written to, you should turn the query cache off — you can use the query_cache_type of ON with the SQL_NO_CACHE keyword to cache all except some queries, or you can use the query_cache_type of DEMAND and the SQL_CACHE keyword in SELECT queries to cache no queries except those you specify.

Why Mark’s Benchmark is skewed

It is important to note that InnoDB was set up with:
innodb_buffer_pool_size=2000M

and the sysbench program was run with:

../sysbench --test=oltp --oltp-table-size=2000000 --max-time=180 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-read-only --num-threads=$nt run

I will explain what the important sysbench variables are in context.

1) The benchmark used sysbench in OLTP mode against an InnoDB table. The InnoDB buffer pool is hot after the table is created and populated, while the query cache is empty and cold.

InnoDB has its own cache — the InnoDB buffer pool. sysbench is run with –oltp-read-only, meaning there are no INSERT, DELETE or UPDATE statements. This is good; it means that the results are not skewed by query cache invalidations, though there may be pruning.

The first thing sysbench does is preparation — it creates the table and populates it with data (2,000,000 rows as per oltp-table-size). After this is done, the InnoDB buffer pool (2000M in size) has gone from empty to containing data. I do not know how large the row size of the sysbench table is; however, I can do some calculations:

2000 Mb * 1024 = 2,048,000 Kb / 2,000,000 rows = 1.024 Kb per row

This means that in order for some data to *not* be current in the InnoDB buffer pool, the row size had to be approximately larger than 1 Kb per row.

The manual at http://sysbench.sourceforge.net/docs/#database_mode shows the schema of the table that is used.

CREATE TABLE `sbtest` ( `id` int(10) unsigned NOT NULL auto_increment, `k` int(10) unsigned NOT NULL default ‘0′, `c` char(120) NOT NULL default ”, `pad` char(60) NOT NULL default ”, PRIMARY KEY (`id`), KEY `k` (`k`)

(note that you at least need to add a final parenthesis “)” for the query to work, and should also specify ENGINE=INNODB if you want to test on your system).

So, that’s two 4-byte integers, and two character fields…for the sake of argument, let’s say you are using UTF-8 in MySQL 5.1, where utf8 is 3 bytes per character, that’s:

4 + 4 + 120*3 + 60*3 = 548 bytes

Let’s say you are using a version of MySQL that uses 4 bytes per character for utf8 (I know 6.0 was using that, not sure about 5.4):

4 + 4 + 120*4 + 60*4 = 728 bytes

So either way, all of the data in the table is already loaded in the InnoDB buffer pool when the test begins. The InnoDB buffer pool is already hot, whereas the query cache is empty.

On systems where a query can be cached and used instead of executing and having to look on disk for data, using the MySQL query cache will give you *much* better performance.

2) The query cache statistics are not given. It’s very possible that the only thing that was measured was how much overhead the query cache provides if it is never used.

The queries that are run are the default (oltp-test-mode is not set, and it defaults to “complex”) which can be seen under the “Advanced transactional” heading in the sysbench manual at http://sysbench.sourceforge.net/docs/#database_mode. Only read queries are done, so the queries are:

SELECT c FROM sbtest WHERE id=N
SELECT c FROM sbtest WHERE id BETWEEN N AND M
SELECT SUM(K) FROM sbtest WHERE id BETWEEN N and M
SELECT c FROM sbtest WHERE id between N and M ORDER BY c
SELECT DISTINCT c FROM sbtest WHERE id BETWEEN N and M ORDER BY c

The values for the variables in the query are chosen at random, though they are sensible values. The testing time is very short, only 3 minutes of running time (sysbench is called with max-time=180). It is quite possible that all the SELECT queries were put into the query cache and there was never a query cache hit. Of course there will be a lot of overhead if the query cache is never used! Particularly if it has to prune for more memory.

I would like to see Qcache_hits, Qcache_inserts, Com_select (number of SELECT statements that did not use the query cache) and Qcache_lowmem_prunes. That is what I look at for any system that has the query cache running, so I can determine the effectiveness of the query cache.

There is more in-depth information on how to find the usefulness of the query cache, query cache fragmentation, and a few ways to cache more queries in the MySQL Administrator’s Bible on pages 427-434 — part of the “Caching with MySQL” chapter. I used the book to help me write this blog post, so I do think it is fair to cite my sources….even though it happens to be a source I created (and that’s why I consider it a shameless plug, I have no shame citing my sources)!

—————-

So that all being said, the query cache is extremely useful when it is used for repeated queries. The benchmark Mark performed most likely did not actually use this. I have seen production environments fall apart when someone turned on on the query cache for all queries; Mark’s benchmark proves that it is not a good idea to randomly turn on the query cache. You have to know what you’re doing.

* for the purpose of replication, it is deterministic, but that’s another topic.