It has been several weeks since the last time I did a DB roundup, (it was epic).
We have been busy, and getting some great and productive stuff done.
- Upgraded the databases behind our critical bouncer application.
- Put the bouncer databases into puppet.
- Started a regular run of pt-table-checksum on the bouncer, addons and support databases, complete with monitoring using our updated version of PalominoDB’s check_table_checksums Nagios check.
- Updated data in the datazilla database
- Eliminated one of our legacy puppet MySQL modules.
- Started using pt-config-diff to see differences between MySQL’s running configuration and its configuration file. Right now we are e-mailing if there is a difference, but the next step is to make it into a Nagios check.
- Created a database for an internal repository related to desktop encryption.
- Updated our developer tools master/master cluster to use auto_increment_increment and auto_increment_offset properly, via puppet.
- Added a new node to the graphs database. Also added the nodes for the panda board project.
- Debugged a problem with the plugin check databases due to our Kohana ORM issuing unnecessary SHOW COLUMNS queries upon every connection.
- Completed the cleanup of the MySQL and PostgreSQL ACL’s to ensure there were no legacy ACL’s from our data center move.
- Gave Selena and Paula (a metrics dashboard implementer) access to Postgres databases, as well as access for a django role account.
- We have dealt with two of our backup systems starting to run out of space, so we are spinning up new VMs to accommodate the increased disk footprint.
- Done our quarterly crash-stats purge.
- Applied a fix to all our postgres databases that corrected a possible data corruption issue.
- Created a database and users for the moztrap test case system.
- Increased max_allowed_packet on the Bugzilla databases when a mass update of 500 bugs caused the application to display the “Got a packet bigger than ‘max_allowed_packet’ bytes” error.
- Manually refreshed the graphs staging database with production data.
- Fixed a bug in a script that caused a file to copy over the script itself.
- Added an ACL in puppet in the crash-stats dev database.
- Ran a manual backfill on the crash-stats database when a weekly cron job failed.
- Added an index to a table in Tinderbox pushlog in dev and production, so that data purges went more quickly.
- Completed upgrading drivers for all of our fusion IO SSD’s to make our SSDs faster with MySQL.
- Upgraded our stage database to MySQL 5.1, and converted it to be innodb_file_per_table (almost everything else in the Mozilla ecosystem is innodb_file_per_table).
- Did our monthly Tinderbox pushlog purge (sadly we use foreign keys so we cannot use partitioning and automation here).
- Upgraded half of the webdev database cluster to MySQL 5.1 and put it under puppet control.
- Created an external database for our Vidyo installation to use, and imported data from our existing embedded Vidyo database.
- Put the checksumming wrapper script into puppet, which has made it much easier to deploy checksumming to more systems.
- Resynced a slave cluster of addons that is only used for checking versions. Ran into a tricky problem with puppet where each of the replicate_wild_do_table and replicate_do_table options had to be done separately in the my.cnf, such as:
But we solved the tricky config problem and made an array that gets split into individual entries. Yay puppet!
- Also resync’d the addons staging cluster.
- Fixed some problems with malformed UTF-8 in a few Bugzilla bugs.
- Lowered the innodb_buffer_pool_size on a cluster so the cluster was not continually swapping.
- Retired “Rock Your Firefox” (no link needed, it’s retired!)
- Added permissions to the new Bugzilla fields for metrics.
- Removed over 9,000 buildbot builds from a MySQL-based scheduling queue that had gotten stale.
- Started to create documentation that is linked from Nagios alerts to better streamline our responses to pages. Particularly actionable steps our site reliability engineers (SRE’s, who are oncall) can specifically do until a DBA can get to a terminal.
- Dealt with interesting undo log corruption on our dev database cluster, which caused a large number of tables have the interesting characteristic that while they could be read from and written to, they could not be dropped (and some could be truncated, others could not be truncated).
- Worked on getting the Percona Toolkit installed on our puppetized machines.
I have tomorrow off for some personal fun (a sheep and wool festival), so I figured I would publish this today, lest I go another week without publishing this!