Category Archives: Performance

OurSQL Episode 19: MySQL Proxy

Direct play the episode at:
http://technocation.org/content/oursql-episode-19%3A-mysql-proxy-0

Feedback:
http://odeo.com/sendmeamessage/Sheeri

Call the comment line at +1 617-674-2369.
E-mail podcast@technocation.org

News:
MySQL Focuses on Japan
http://www.mysql.com/news-and-events/news/article_1368.html

MySQL Associate Certification Now Available
http://www.mysql.com/certification/

Learning resource:
Pythian Group’s Carnival of the Vanities for the DBA community, published weekly on Fridays.
http://www.pythian.com/blogs/category/log-buffer/

Feature:
MySQL Proxy
Giuseppe Maxia’s Blog:
http://datacharmer.blogspot.com

Getting Started with MySQL Proxy article plus tutorials:
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

public Subversion tree:

http://forge.mysql.com/wiki/mysql_proxy

Tutorials:
Intercept and dump queries (part 1): http://forge.mysql.com/snippets/view.php?id=75

Make macros to map “cd” to “use” and “ls” to “show tables” (part 2): http://forge.mysql.com/snippets/view.php?id=76

Injection Queries (part 3): http://forge.mysql.com/snippets/view.php?id=77

Lua interpreted language:
http://www.lua.org/

OurSQL Episode 18: De-myth-tifying Indexes

Direct play this episode at:
http://technocation.org/content/oursql-episode-18%3A-de-myth-tifying-indexes-0

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum

Episode 18: De-Myth-tifying Indexes

Where I have been:
Wedding video (short) and photos:
http://www.sheeri.com/wedding

Honeymoon (and wedding) photos:
http://www.sheeri.com/photos

News:
FREE training for Oracle DBAs who want to learn MySQL
http://www.planetmysql.org/robin/?p=51
http://www.mysql.com/oracle/

mysqlnd (PHP native driver) needs testers and benchmarkers:
http://www.planetmysql.org/kaj/?p=117

Learning Resource:
The MySQL category at howtoforge.com:
http://www.howtoforge.com/taxonomy_menu/1/7

Podcast Promo:
http://www.themysqlguy.com

Feature:

Big O notation:
http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node8.html

BTREE Index podcast:
http://technocation.org/content/oursql-episode-13%3A-nitty-gritty-indexes-0

HASH Index podcast:
http://technocation.org/content/oursql-episode-17%3A-hashing-it-out-0

Quote:
http://www.helpothers.org/story.php?sid=6784

Music:
The main theme used is Angry Red Dwarf’s “I Dream About You”
http://tinyurl.com/26hbg6

Smallfish’s “The Thank you song”
http://www.smallfishadventures.com/Home.html

OurSQL Episode 10: How About Some Cache?

This week I talk about the MySQL Query Cache.

Direct play the podcast here:
http://technocation.org/content/oursql-episode-10%3A-how-about-some-cache%3F-0

Subscribe to the podcast by clicking:
http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=206806301

You can Direct download all the oursql podcasts at:
http://technocation.org/podcasts/oursql/

Show notes:

Listener Feedback:
http://dev.mysql.com/doc/refman/4.1/en/mysql-commands.html

News:
Daylight Savings Time and how to check your system:
http://sheeri.net/archives/188

There’s not much more time left to register for the MySQL Users Conference & Expo before the $200 early bird discount disappears!
http://www.mysqlconf.com

Learning Resource:

Check out the 2006 MySQL conference presentation slides by the speakers! http://mysqlconf.com/pub/w/45/presentations.html

Feature: How about some cache?
The MySQL Manual has a short, very readable chapter on the Query Cache, which starts here:

http://dev.mysql.com/doc/refman/5.0/en/query-cache.html

MySQL Queues, part II — groups of queues

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

Continue reading

Lots of Boolean Values

So, the biggest issue from last month’s Boston MySQL Meetup was “What is the most efficient way to query across many values that are boolean in nature?”

Now, it was a bit more complex than that, as it always is. Values may be set or not set, and people may search across any number of boolean values, or all of them. There are over 1 million entries, and over 40 boolean values!

A few ways of doing so came up:

1) Simply put all the values into separate columns, with a BOOLEAN type (really TINYINT(1))
2) Using ENUMs
3) Using foreign keys to “join tables”
4) Using SET
5) Using bits and bit operations

Using real data, the original questioner and I sought to find the best answer for his case.

One TRIGGER Fact, Optimizer Rewriting Stuff

  • One thing about TRIGGERs is that you can have 6 triggers per table:
    BEFORE
    AFTER
        INSERT
    UPDATE
    DELETE

    That was brought up at the User Group meeting.

    In other news, I have been reading the MySQL Internals Manual. The most interesting chapter yet is Chapter 3, the optimizer. This can be read out of order, and actually contains lots of information that folks can use. For instance:

    WHERE col1=col2 AND col2=’x’
    transforms to
    WHERE col1=’x’ AND col2=’x’

    because constant matching is faster.

    Tautologies and always-false statements are usually optimized out of the query so that something like WHERE 0=0 does not waste time by actually being tested.

    A table with 0 or 1 rows is a constant value. Another constant table scenario is WHERE col1=[constant] for all the columns of any UNIQUE or PRIMARY KEYS, assuming they are defined as NOT NULL. This is a good reason to use NOT NULL for column definitions unless you really need NULL.

    A constant table is what is used by the optimizer when you see “const” as the “join type” in the result of an EXPLAIN.

    WHERE col1 in (1,2,3) is the exact same expression as WHERE col1=1 OR col1=2 OR col1=3

    WHERE col1 BETWEEN 1 AND 3 is the exact same expression as WHERE col1>=1 AND col1< =3

    The latter 2 examples are good to know, because the first of the two statements in each case are more clear.

  • Semi-Dynamic Data

    http://www.artfulsoftware.com has a “Common Queries” page, which I find rather useful. I hadn’t realized its history, as described on the homepage:

    Our collection of common MySQL queries outgrew Chapter 9, and is still growing, so we turned it into a PHP page driven from a MySQL table.

    One day, I clicked on the page and got the dreaded “blank” PHP page.

    This gets into one of the fundamental flaws I find with “semi-dynamic data” (my terminology for it) — it is not completely dynamic data, because it gets updated by humans, and it is deterministic*, so it does not need a completely dynamic page.

    Part of the updating process could be a “generate the web page” script, that runs what the actual page is now, but stores the result as an HTML page. In this way, if 1000 users want the same page, there are *no* database queries done. After all, it only needs to change when content is uploaded, which isn’t very often at all.

    The “generation” script could easily be a part of a web form that uploads content, or it could be a separate form/script run after a batch of changes is done, so multiple changes do not require generating pages that will just be written over after the next update in a minute or so. As well, it could write to a temporary file, and the very last script action would move the temporary file to the right place. In this way, a generation script that takes a long time to finish would not be partially overwritten by another, simultaneous generation script.

    I have used this technique in content management systems — particularly with templates, as I’ve found you can separate different content items (such as menus) and “break apart” a template into pieces, and with about an hour you can support a new template into an existing system, and have a user compare templates to see which they’d rather use, given their own content.

    I have also used this technique with a listing of organizations around the world. All the listings (3,000) were stored in a database. From this, I ran a weekly (but it would be easy to run it hourly or daily) script that made “browse by” pages, categorizing all of the resources by first letter of their name as well as by their area, province/state and country. The script, which took a full 10 minutes due to poor optimization, made an overall browsing page, 26 “by letter” pages, one page for each country, and a directory with one page for each state/province and area for each country. It also generated the page and compared it to the existing page, and only overwrote the page when they differed (and then put a “last updated on:” message at the end).

    Folks searching could still get truly dynamic pages, but I cut down on needless direct database calls to find out which organizations were in “England”, and more needless database calls to find out which organizations were in the “Greater London” area, when those pages changes rather rarely.

    This could also be useful for large sites, such as photo sharing galleries. Sites that allow comments may or may not see a performance gain — for each new comment, generating the page again may not be the best solution. However, if there’s at least one page hit for each database write, then using this method will have better performance.

    * an example of a nondeterministic page is one that changes based on the time, such as “show me all activity that has happened today [up until this moment]” because it may change from second to second.

    Selectivity and Index Performance

    Let’s look into selectivity, as this is an important topics when looking at index performance. (Oooh, I said “performance”, watch everyone’s ears perk up!).

    This will probably answer the questions “Why isn’t MySQL using my index?” or “Why is my query so slow when I have an index on it?”

    Selectivity describes how different values of a field are. It is a number from 0-1, although you can also think of it as a percentage. A value of 1, or 100%, means that each value in the field is unique. This happens with UNIQUE and PRIMARY keys, although non-unique fields may have a selectivity of 1 — for example, a timestamp value in a not-often-used table.

    To calculate this, you take the total number of DISTINCT records and divide by the total number of records.

    My company has a large Users table, so I grabbed some statistics off of that:


    +----------+
    | count(*) |
    +----------+
    | 817666 |
    +----------+
    1 row in set (0.63 sec)

    +--------------------------+
    | count(distinct username) |
    +--------------------------+
    | 817666 |
    +--------------------------+
    1 row in set (1.63 sec)

    So the selectivity is 81766/81766, or 1. If this were not a UNIQUE KEY already, it’s a good candidate for one.

    the “created” field is a timestamp for when the user record was created
    +-------------------------+
    | count(distinct created) |
    +-------------------------+
    | 811227 |
    +-------------------------+
    1 row in set (2.04 sec)

    As I expect, there are *some* duplicates, but for the most part, everyone has a different creation time. 811227/817666 = 0.99.

    +--------------------+
    | count(distinct IP) |
    +--------------------+
    | 544694 |
    +--------------------+
    1 row in set (2.35 sec)

    This is interesting — lots of people logon from public places, or use their friends’ computers, so there are duplicate IP’s (the last IP used to login is associated with a user, so it’s a 1-to-1 relationship). The selectivity here is 0.67.

    +-------------------------+
    | count(distinct browser) |
    +-------------------------+
    | 25699 |
    +-------------------------+
    1 row in set (1.70 sec)

    This is what the server reports the user’s browser is. It records the last browser used by the user. This gives us about a 0.03 for selectivity.

    +---------------------+
    | count(distinct age) |
    +---------------------+
    | 83 |
    +---------------------+
    1 row in set (0.63 sec)

    There are only 83 different reported ages on our site. That makes the selectivity of age 0.000101508. That is very low, effectively zero.

    So why is this important? I’m glad you asked….

    MySQL has a cost-based optimizer. This means that MySQL calculates the costs of different ways of performing a query and then chooses the cheapest one. Sounds reasonable, right? Well, calculating the costs is an inexact science. In order to calculate the exact cost, the optimizer would actually have to run the query. So an estimate is taken, and the estimate is wrong sometimes. Most of the time the estimate is correct.

    In contrast, some database systems allow a rule-based optimizer. This means that no matter what the data state, the database uses rules to figure out the “optimal” path to the query. In most enterprise-level database systems, a cost-based optimizer performs better than a rule-based optimizer. In other words, there are so many exceptions to the rules that the calculation overhead is worth it.

    (Just to clarify, in both systems, the correct result set will be generated. The optimizer determines the path to the information.)

    This cost-based optimizer uses selectivity information when it decides whether or not to use an index.

    But what does this mean for me?

    Well, in the example above, this means if you want to query folks by age or age group, it’s useless to put an index on it. It’s a waste of cpu time and disk I/O to have an index for something with such a low selectivity. The optimizer will NEVER use it.

    I’ve heard that the optimizer will do a full table scan if it calculates it will return more than 30% of the table. Why is that number so low, why not more like 50 or 75%? Well, first the server has to go to the index, search the index, and find if the index record matches. Then it needs to follow the index record’s pointer to the real record on disk. And the MySQL gurus have decided that around 30% is the place where using the index is slower than just doing a full table scan.

    (Note: I’m not exactly sure what the # is but I’ve heard it’s around 30%. As well, at the user conference I saw graphs that showed that for the most part this was true. I thought it was in Jay Pipes’ Performance Tuning presentation, but the graphs are not in the slides. Pointers are appreciated.)

    So in this case, should we put an index on browser? Well, this is one of those cases where I’d think about how often we’d be doing queries and how much we care about server performance doing a report versus server performance while inserting or updating a record. If we really care one way or another, go that way. And document!

    Another thing to consider is the nature of the data. For something like age, that’s not going to change. Sure, we might have some 120 year olds eventually, but there’s not going to be that much variance. For browsers, there will only be more and more types put out, considering different version numbers and OS configurations of standard browsers as well as mobile phone browsers.

    However, if it does not matter, or if it’s too difficult to decide which is more important (your boss says “we can’t be slow during normal usage OR during reporting!”) I default to MySQL — it’s better at optimization than I am. I’d probably put an index so MySQL could decide whether to do a full table scan or use an index, particularly given that I expect the number of browsers to keep increasing.

    For something like IP, that has a selectivity of 0.67, so putting an index on it is worth it if we query on IPs a lot.

    I hope this article has been helpful!

    Jay Pipes has a great article on using the INFORMATION_SCHEMA to find out selectivity:
    http://tinyurl.com/kfffp

    The Care and Feeding of MySQL Tables

    Our site went from weekly crashes during our two busiest nights to not even peeping this week (during the two busiest nights), and the only thing we changed was that we did some table maintenance. We hadn’t done table maintenance at least as long as I’ve been around, which is 6 months. We are a site with high volumes of both reads and writes. This article will talk about the care and feeding of tables; feel free to use this for justification to have a maintenance window, or even permission to run table maintenance statements.

    MySQL uses a cost-based optimizer to best translate the written query into what actually happens. This means when you write:

    SELECT foo FROM t1 INNER JOIN t2 USING (commonField);

    The optimizer looks at the statistics for tables t1 and t2 and decides which is better:
    1) To go through each item in t1, looking for a matching “commonField” in t2
    or
    2) To go through each item in t2, looking for a matching “commonField” t1

    If t1 is very large and t2 is very small, it makes sense to follow plan #2. This is a simplified example, of course.

    From the documentation:
    ANALYZE TABLE analyzes and stores the key distribution for a table. During the analysis, the table is locked with a read lock. This statement works with MyISAM, BDB, and InnoDB tables.

    If the key distribution is off, the optimizer will be using incorrect (out-of-date) information. Therefore, the optimizations it makes will not be…well, optimal.

    ANALYZE TABLE takes a very short amount of time — less than a second for even a million rows. I tested with InnoDB and MyISAM, but I’d guess that BDB is the same. Our database of 14G took less than a minute to analyze all 112 tables in 3 datases.

    Documentation: http://dev.mysql.com/doc/refman/5.0/en/analyze-table.html

    CHECK TABLE check tables and views for incorrectly closed tables, incorrect or deleted links, and verified checksums for the rows. It can also check for full consistency.

    This takes a bit — checking our tables for everything (but not checking for full consistency, as it takes longer) took 11 minutes (14G, 112 tables in 3 databases). Next month I will run a CHECK EXTENDED and see how long this takes.

    Documentation: http://dev.mysql.com/doc/refman/5.0/en/check-table.html

    And the daddy of them all:
    OPTIMIZE TABLE can be used on MyISAM, BDB and InnoDB tables. In MyISAM tables, it repairs deleted or split rows, updates index statistics, and sorts the index pages. For InnoDB and BDB, OPTIMIZE TABLE maps to ALTER TABLE and just rebuilds the index, thereby getting rid of defragmentation, corruption and incorrect statistics.

    Documentation:
    http://dev.mysql.com/doc/refman/5.0/en/optimize-table.html

    This took 36 minutes on our (14G, 112 tables in 3 databases) server.

    From the documentation:
    In most setups, you need not run OPTIMIZE TABLE at all. Even if you do a lot of updates to variable-length rows, it is not likely that you need to do this more than once a week or month and only on certain tables.

    This morning, we backed up our data. Then we ran ANALYZE TABLE on all of our tables first. Sure, OPTIMIZE TABLE performs the same function, but OPTIMIZE TABLE takes a long time, and we wanted the benefit of ANALYZE TABLE right away. Plus, if anything failed, at least the table’s index statistics are up-to-date. Then we ran CHECK TABLE, and then OPTIMIZE TABLE on each table. We did this while running live, and as I said, the entire process took less than an hour.

    Actually, it took 40 minutes, because once the script I had running to CHECK TABLE was halfway through, I started the OPTIMIZE TABLE script. I specifically set the tables to run through in alphabetical order, so there was no chance of the scripts trying to run on the same table. I will not do that again, I will just run them serially for safety’s sake.

    One thing MERGE Tables Are Good For

    Many people have some kind of reporting or auditing on their database. The problem is that the data grows very large, and lots of times there is data that can be purged. Sure, theoretically one never needs to purge data, but sometimes a “delete” flag just won’t work — when you search on the delete flag, a full table scan may be the most efficient way to go.

    Of course, that’s not acceptable. And in many cases, say when you have users who no longer use the site but did in the past (and perhaps have billing data associated with them), you never want to get rid of them.

    So what to do? Make a special reporting database, that gathers information from the production database(s). Use MyISAM tables, because a reporting server can afford to be behind the master, and MyISAM is better for reporting — better metadata. For something like a “Users” table, make 2 more tables:

    1) DeletedUsers
    2) AllUsers

    Where Deleted Users is where you put information about the users you delete (something like INSERT INTO DeletedUsers SELECT * FROM Users WHERE [parameters of deletion] and then run DELETE FROM Users WHERE [parameters of deletion] on the master. On the reporting slave, make a MERGE table called “AllUsers” and run your reports from that, when you might need to gather historical table.

    Thoughts?