Archive for the ‘mysql’ Category

Percona and HandlerSocket

Percona just announced that they’re going to include HandlerSocket in the latest Percona build.

This is huge news. In case you aren’t aware, HandlerSocket is a raw protocol interface that is similar to a dictionary lookup (kind of like memcached, or a hashtable) and does NOT go through the SQL parser.

The performance implications are huge as this thing flies.

Most web applications access data in terms of key/values. In fact, databases like Redis take the concept a bit further and implement functions you can use to manipulate keys directly.

What’s interesting here is that I think HandlerSocket is bigger than Percona and probably deserves a whole project and open source community behind it directly.

I spend the last couple days playing with InnoDB page compression on the latest Percona build.

I’m pretty happy so far with Percona and the latest InnoDB changes.

Compression wasn’t living up to my expectations though.

I think the biggest problem is that the compression can only use one core in replication and ALTER TABLE statements.

We have an 80GB database that was running on 96GB boxes filled with RAM.

I wanted to try to run this on much smaller instances (32GB-48GB boxes) by compressing the database.

Unfortunately, after 24 hours of running an ALTER TABLE which would only use one core per table, the SQL replication thread went to 100% and started falling behind fast.

I think what might be happening is that the InnoDB page buffer is full because it can’t write to the disk fast enough which causes the insert thread to force compression of the pages in the foreground.

Having InnoDB only use one core / thread to compress pages seems like a very bad idea (especially on 8-16 core boxes, I’m testing on an 8 core box now but we have 16 core boxes in production).

The InnoDB page compression documentation doesn’t seem to yield any hints about when InnoDB pages are compressed and in which thread. Nor does there seem to be any configuration variables that we can change in this regard.

Perhaps a ‘compressed buffer pool only’ option could be interesting.

This way InnoDB does not have to maintain an LRU for compressed/decompressed pages. Further, it can read pages off disk, decompress them, and then leave the pages decompressed in a small buffer. Then a worker thread (executing on another core) can compress the pages and move them back into the buffer pool where they can be stored and placed back on disk.

This process could still become disk bottlenecked but at least it would use multiple cores.

It seems to be impossible to perform client-side paging full table scans within MySQL.

For example, say you want to take a 1GB file and page through it 10MB at a time.

With a flat file I could just read 10MB off disk, read the next 10MB, etc.

This would be amazingly efficient as you could write the data sequentially, without updating indexes, and then read the data back out by byte offset.

You could page through MySQL tables by adding an index to a table:


for example … but I REALLY want to avoid an index step because it is not cheap and only required since MySQL doesn’t support cursors.

This index slows down the import stage and I have to buffer the index data in memory which is just a waste.

I could use LIMIT with OFFSET but this isn’t going to be amazingly efficient because it will either require us to use a temporary on disk table or force us to read each row off disk.

Technically, one could implement offset efficiently with a fixed with table since you can compute the byte offset as row_width*N but I don’t think MySQL implements this optimization.

Further, my tables are are variable width.

If MySQL supported cursors then I could just tell MySQL to perform a table scan and read the data from a cursor.

I’m implementing something similar to a Map Reduce job in MySQL and Java which would work VERY well on Hadoop but is nearly impossible to implement with MySQL efficiently since I can’t page through data without over-indexing my tables.

Thoughts? I would love it if there’s some stupid solution that I’m just missing.

I suspect this will become more and more of a problem as more developers want to just flat out Map Reduce their data.


I was wrong. The solution is HANDLER. Ryan Thiessen nailed it in the comments.

Spinn3r will be hosting an Open MySQL meetup at Oracle Open World (which is right down the street).

This would be on Wed 10/14 2009 at 7pm … at 580 Howard Suite 301 (Spinn3r HQ)

Oracle owns MySQL, InnoDB, etc so I suspect a lot of Oracle people and MySQL hackers will be interested in attending more of an Open Source and community centered meetup.

We’ll just be hanging out at our offices … we’ll have beer and food.

Feel free to bring your laptops as we have Wifi :)

This is contingent on at least 10 RSVPs as I want to make sure there is interest from the community.

Please RSVP here

It’s a bit late notice so if you could help spread the world by blogging about this that would be GREAT!

Has anyone done any more work on recompiling InnoDB with 4k pages and benchmarking under SSD?

We’re building out a new DB that uses very small records (around 32-64 bytes) so reading a whole 16k for this record should have a performance difference.

I haven’t seen any benchmarks on 16k random read IOPS on the Intel SSD but my hunch is that there will be a 20-30% penalty here.

Though even if it was a 4x penalty that would still be about 9k transactions per second which is pretty good.

On a personal note I just bought a new Mac Book Pro which will be upgraded to the Intel X-25M MLC SSD.

Needless to say I’m very excited!

We’re looking to hire a Senior MySQL DBA over at Spinn3r.

You should obviously have MySQL experience. Love SQL, hate data corruption and slow queries, and preferably live in San Francisco.

Linux experience would be nice as well but not required.

Extra points if you are excited about SSD, *huge* amounts of data, have hacked on Drizzle or XtraDB

Spinn3r is a GREAT place to work. We’re growing fast and have cool new offices in SOMA.

Here’s the problem I currently have.

We’re looking at deploying the Intel X-25M MLC SSD in production.

The problem being that this drive has a lower number of erase cycles but is much cheaper. Than the Intel X-25E SLC drive.

However, in our situation we’re write once, read many. I’m 99% certain that we will not burn out these drives. We write data to disk once and it is never written again.

The problem is that I can’t be 100% sure that this is the case. There is btree flushing, and binary log issues that I’m worried about…

What would be really nice is an API (SMART?) that I can enumerate the erase blocks on the drive, determine the max erase cycles, and read the current number of erase cycles.

This way, I can put an SSD into production, then determine the ETA to failure.

I can also add this to Nagios and Ganglia and trend the failure date and alert if the derivative is too high and the drive will soon fail.

Further, I can figure out if a database design is flawed. If I deploy a new database into production and the failure ETA is too high after 24 hours I know that something is wrong. Either a misconfiguration or a problem with the design.

I think this would solve a LOT of the problems with deploying SSD in enterprise environments. (MySQL, Oracle, etc)

Spinn3r is growing fast. Time to hire another engineer. Actually, we’re hiring for like four people right now so I’ll probably be blogging more on this topic.

My older post on this subject still applies for requirements.

If you’re a Linux or MySQL geek we’d love to have your help.

Did I mention we just moved to an awesome office on 2nd and Howard in downtown SF?

One of the things that has always bothered me about replication is that the binary logs are written to disk and then read from disk.

There is are two threads which are for the most part, unaware of each other.

One thread reads the remote binary logs, and the other writes them to disk.

While the Linux page buffer CAN work to buffer these logs, the first write will cause additional disk load.

One strategy, which could seriously boost performance in some situations, would be to pre-read say 10-50MB of data and just keep it in memory.

If a slave is catching up, it could have GIGABYTES of binary log data from the master. It would then write this to disk. These reads would then NOT come from cache.

Simply using a small buffer could solve this problem.

One HACK would be to use a ram drive or tmpfs for logs. I assume that the log thread will block if the disk fills up… if it does so intelligently, one could just create a 50MB tmpfs to store binary logs. MySQL would then read these off tmpfs, and execute them.

50MB-250MB should be fine for a pre-read buffer. Once one of the files is removed, the thread would continue reading data.

This might be a bit cutting edge, but the new fallocate() call in > Linux 2.6.23 might be able to improve InnoDB performance.

When InnoDB needs more space it auto-extends the current data file by 8MB. If this is writing out zeros to the new data (to initialize it) then using fallocate() would certainly be faster.

Apparently, XFS supports this too but needs an ioctl. XFS could support fallocate in the future as well…

I’m enamored by the middle path.

Basically, the idea is that extremism is an evil and often ideological perspectives are non-optimial solutions.

The Dalai Lama has pursued a middle path solution to the issue of Tibetan independence.

The two opposing philosophies in this situation are total and complete control of Tibet by the Chinese or complete political freedom by the Tibetan people and self governance. The Dalai Lama proposes an alternative whereby China and Tibet pursue stability and co-existence between the Tibetan and Chinese peoples based on equality and mutual co-operation.

How does this apply to Linux?

The current question of swap revolves around using NO swap at all or using swap to page out some portion of your application onto disk.

Even with the SplitLRU patch we’re still seeing problems with Linux swap.

The entire question is Mu

The solution isn’t to disable swap. The solution is to use a SMALL amount of swap (100-200MB) and monitor your application to detect when it is swapping and then tune back the memory usage of your application.

The difficulty is that you often want to use the maximum about of memory in the system. Imagine a theoretical system that efficiently uses 100% of system memory. An overcommitted application might allocate a BIT more and cause the OOM killer to kick in.

This would be horrible. Instead, why not just page 10-20 bytes.. It’s not going to have devastating effects on the system and if you’re monitoring it you can decide to tune back your memory usage by 100MB or so.

We have this happen from time to time with MySQL. We allocate slightly too much memory only to have the OOM killer kick in and kill MySQL – not fun.

Using a large swap file isn’t the solution either. If you overtune you can swap out a large portion of your application. This can essentially yield a denial of service as the load on your server becomes so high that you can’t even SSH in after hours of usage. The only solution is to reboot the box.

Using a small swap file avoids this. If you’re swapping more than say 50MB you can have your monitoring system kick in and alert you before the limit is hit and your OOM killer kicks in.

We’re experimenting with this idea this week and I’m optimistic about the results.

I’ve been thinking about SaaS (in the form of Spinn3r) and how it relates to Open Source for the past few months and I think I’ve come to some interesting conclusions. I think SaaS might be a strong competitor to Open Source in that it’s cheaper and higher quality in a number of situations.

Apparently, I’m no the only one:

Open source is always driven by some organisation – a central body that leads community development efforts to support developers and build revenue streams. In essence, that body gives away the base code and knowledge of the community version to encourage development of the service and expand distribution; to make the software go ‘viral’.

However, I believe software as a service (SAAS) has undermined this model.

SAAS offers ready access to beautifully crafted applications and services through the browser for little or no initial cost. These applications supersede centrally-held open source projects since a. they are finished products (rather than base codes, which must be developed into end-user services) and b. can be easily found, used and shared by the end users of the application/service.

More thoughts on the subject are floating in the blogosphere as well:

Opensource tends to build passionate users that consider themselves, to a certain extent “owners” and “developers” of the product in question. These communities tend to be rabidly loyal and have a tendency towards evangelisation. This is clearly a hugely powerful aspect of OSS and should be harnessed.

SaaS on the other hand tends to build networks or communities of individuals that share a commonality – be it use, interest whatever. SaaS users tend to be loyal to a point, but not nearly as loyal as opensource-ers.

To a certain extent SaaS enterprises have attempted to create the opensource level of community by embracing the concepts of beta-testing and user feedback and development. This however has been reasonably limited (mainly due to the fact that opensource is free, at some point a free beta-test of a SaaS product will generally swing over to a subscription based service).

My experience running Spinn3r has be coming to similar conclusions.

First, we don’t compete with Open Source crawlers in our interactions with customers. Why? they’re amazingly expensive in comparison.

We run a cluster of 20-30 boxes and handle all aspects of running the crawler. We’re about 1/10th of the cost vs doing it yourself since we can amortize our operations across dozens of customers.

In our situation. Open Source isn’t free. It’s 10x the cost of using Spinn3r. It seems counter intuitive but TCO really comes into play here.

Second. We’re profitable and have no problem paying our developers, buying hardware, outsourcing development, buying tools, etc. Open Source (at least in its purest form) has traditionally had problems raising capital and has often depended on the patronage model. What’s worse, if they follow the MySQL/RedHat model they often put themselves at odds with their original community which can lead to tension.

This isn’t to say that Open Source is going to go away. We’re big fans of Open Source. Most or our architecture is OSS. Heck. Even our reference client is Open Source.

It just seems that SaaS is going to grow to push Open Source out of certain areas due to price, efficiency, and quality issues.

In the end I think this is good for the market and the industry as a whole.

Certainly, our customers are very happy.

PS. As an aside. I’ve always felt that free market economics and Open Source were always hand in hand. When I was doing News Monster (which was both Open Source and Free Software) I would joke that it wasn’t “free as in beer” it was “free as in $29.95.”

We made it easy to checkout NewsMonster directly from CVS and build your own version if you wanted but if you wanted the easy one click install (which included support) then you needed to pay $29.95. Most of our users (99%) opted to pay…

The more databases you have in your cluster the greater the probability they’re going to fail.

It’s basically MTBF/N until you have one of these boxes crash. If you have 10k machines expect multiple failures per day.

If you’re performing realtime writes to these databases you can lose a DB in the middle of a write.

Now what? Do you lose transactions?

Most people using InnoDB/MyISAM have used write caching controllers to solve this problem.

You buy this expensive card with 128MB of memory which temporarily caches writes. The card has a battery so if you lose power you spin the disks until you can get your data onto the disks.

But as I mentioned before, these devices are expensive. Expect them to add another 20% on the price of your cluster.

This doesn’t sound like a ton of cash if you have one or two machines but if you’re buying 50-1000 it’s a significant chunk of change.

Another way around this is to make sure your write hits at least two nodes (three if you’re paranoid).

If you’re using a sharded databased you might have 2-3 replicas per shard.

What you do is then make sure your transaction made it to one more node.

This is essentially what Google does. They have a distributed write ahead log for an entire shard.

If one of their replicas in a shard fails the transaction is on two more boxes.

For Google saving that extra %20 at their scale is a LOT of cash.

Hopefully the semi-sync patch is isolated soon. This means you can now use the same fault recovery mechanism. It doesn’t matter if a master fails as long as you have semi-sync, have the transaction data already on another box, and bring a new replica online quickly.

Now we can shave these write caching controllers out of our budget.

The main question is what type of performance hit will we take?

Fortunately, InnoDB has a few variables you can set.

If you wet innodb_flush_log_at_trx_commit to zero then InnoDB will commit less often. Meaning more data is buffered (but remember it’s already replicated to a slave).

Another is sync_bin… You should set this to zero. This way the binary logs aren’t flushed too often.

The main question is which is faster:

– InnoDB with innodb_flush_log_at_trx_commit=1 and sync_bin=1 WITH a write caching controller


– InnoDB with innodb_flush_log_at_trx_commit=0 and sync_bin=0 WITHOUT a write caching controller but with semi-sync.

I haven’t had time to run the benchmark but I suspect that the write caching controller is faster. By how much is my main question.

Of course you’re going to need the global transaction ID patch too. Without this you can’t really run three replicas per node.

Update :

Some more thoughts.

Yes. You can lose an entire datacenter and then lose transactions. That’s why you need more than one datacenter. We’ve lost power three times this year (it isn’t fun).

While InnoDB would probably perform faster on BBU hardware a Bigtable or log structured DB without fsync would write at almost the full disk head write speed (80-120MB/s).

I still want a hybrid flash/supercapacitor based controller. Hopefully on the drives. Then I can just power off the drives and the DRAM cache on the drives is written quickly to the flash.

Does this exist yet? HDDs are going to exit until 2010 for bulk storage (TB sized arrays) so I think they’re worth the investment.

They should only add $20-50 to the cost of your disks so probably worth the investment at that point.

The Sun Fishworks guys were nice enough to invite me for a demo of their new 7000 series storage device.

We bang the heck out of our IO systems here at Spinn3r so having more options is always welcome.

Bryan Cantrill, one of the original DTrace developers, worked on this bad boy so there’s obviously going to be an emphasis on performance analysis.

This is one of the main competitive advantages of the 7000 series.

Out of the box you have a full admin console for performance tuning. It doesn’t stop at just raw IOs because they’ve instrumented it with a bunch of dtrace scripts.

You can view IOPS per file, CPU, make runtime tuning and configuration changes. Basically, the entire device can be monitored and configured with just a few clicks.

Apparently, they can’t really open source this thing because it hooks too far into OpenSolaris. I’m very sympathetic to this problem though. At Spinn3r we have a TON of infrastructure I would rather Open Source but I can’t because it’s too integrated into our full stack.

Of course I don’t have one of these guys in production. The devil is ALWAYS in the details. In the past I’ve thrown hardware I was really excited about into production only to have it fail a week later.

If you’re already an OpenSolaris shop this is probably a really easy purchase. Linux? Maybe not so much. If you’re on MySQL it won’t really be able to introspect your InnoDB buffer pool. It might be able to look at the individual files which would be nice.

In Linux you can run lsof but I really want an iotop so that I can see which files are being used and which are getting iops and where. It seems you can do this with the 7000 but a curses based app would be nice as well.

This thing also has flash which is nice. It’s basically meant as a front end read cache. Blocks are moved from the disk and written sequentially to the flash to satisfy reads.

Looks like their flash is pretty much OEM STEC. A stock Zeus IOPS drive at 16GB. Apparently, they’re they only ones that actually work.

It’s still not what I really want for our in-memory database. I want it to be flash all the way and be able to handle more than 10k random write IOPS.

The Fusion IO stuff is interesting but the driver isn’t Open Source. The Micron stuff looks awesome as well.

These guys are either going to have to go open source or standardize on a new Flash-aware SATA/SAS/FC interface.

Destroying MySQL


This post hit Reddit and now has 10k views. This post was written for the MySQL community who pretty much already read my blog and know that I support MySQL (we use it in production, document bugs, contribute patches, etc).

While Postgres is a great database, MySQL still has a number of features that Postgres doesn’t have. One good example is the use of multiple storage engines (PBXT, InnoDB, MyISAM, etc).

The entire point of this post was to encourage Sun/MySQL to focus on the quality and stability of their releases.

Everyone I talk to who works for LARGE MySQL installations is complaining about the lack of quality and scalability. These problems need to be fixed not routinely swept under the rug.

Original post:

Let’s say we wanted to destroy MySQL. Basically, make sure no one EVER uses it again.

Further, we want to simultaneously con as many people to fall into the MySQL trap and release products only to have MySQL crash and dump core. (This would have the added benefit of completely destroying the MySQL brand as new developers complain how they migrated to MySQL 5.0 only to have it crash.)

Well, I would perform the following:

– Release MySQL 4.1 with crash bugs. Further, release features like subqueries that are pathologically designed to not actually use any indexes.

– When the community complains, promise to never do it again and that you’ve learned your lesson. Spend the next 6-12 months fixing these bugs.

– Release MySQL 5.0 with crash bugs. Basically, cram as many features as possible and only begrudgingly accept patches.

– When the community complains, promise to never do it again and that you’ve learned your lesson. Spend the next 6-12 months fixing these bugs.

– Further, release TWO versions of MySQL and screw over your original community by keeping the latest and greatest releases from them. Also, screw over your new community by making sure the new releases aren’t tested.

– Spend years talking about the quality of the next release and the massive new (and awesome) feature set.

– Slip your release date by 12 months.

– Allow crash bugs to remain in the new release

– The new features you were talking about? Yeah. They should be beta quality at best. Basically, you want them to try the new features but have them crash when deployed into production.

– Further, new features you’ve been talking about this entire time (row based replication), shouldn’t be enabled by default because they’re unstable.

– When the community complains, promise to never do it again and that you’ve learned your lesson. Spend the next 6-12 months fixing these bugs.

This might be a little harsh but I’m trying to make a point.

I don’t think MySQL deliberately set out to accomplish this goal but they seem to be doing a good job.

MySQL is trying to ship features so they can sell to the market. It’s a catch 22. You’re not going to be able to sell an unstable product.

Further, these are anti-features for my company.

I’m not going to be buying MySQL Enterprise anytime soon. Further, I’m actively discouraging people from purchasing it…

Why? It’s just a bad deal. They’re going to give you a binary which isn’t tested by the community and has limited development. How is this a good idea?

I’m going to be throwing down $20-30k in 2009 on MySQL development but it isn’t going to be given to Sun, MySQL, or Oracle. It’s going to be given to companies like Percona, Open Query, or Proven Scaling that actually care about their customers and release stable fixes to MySQL community.

We’re still on MySQL 4.1. We’re just now migrating to 5.0. Why? Because we depend on performance improvements and fixes present in the community fork of 5.0.

These are not features/patches that are present in official Sun/MySQL binaries.

The Percona and OurDelta builds are good examples. They include VERY important patches that are not present in MySQL 5.1 despite years of development.

About a year ago I needed a partitioning engine. I played with MySQL 5.1’s partitioning engine which looked decent but it crashed when put it into production.

I gave them the benefit of the doubt because it wasn’t GA yet. The bug that caused our crash – yeah, it’s still not fixed.

I waited two months and decided that a partitioning engine was too important to the success of my company to wait around considering MySQL’s poor history.

We now have our own partition engine. It’s in production, reliable, and serving our customers. It’s also stable and has more features than the stock MySQL partitioning engine.

This is a recipe for failure.

I have faith that MySQL can turn the boat around off but they’re going to need to totally halt development for a year and concentrate on stability.

Heck, if you make the right decisions, I might even become a customer!

Desirable InnoDB Features

These are a few features I want to see implemented in InnoDB this year.

We need them for production so we’re probably going to throw down some cash to see this happen. The patches will have to be Open Source of course.

If there are any companies out there that would like to co-sponsor these features feel free to add a note in the comments or contact me via email.

Native InnoDB warmup

We have about 100GB of InnoDB which needs to run 100% out of memory. Our per-DB images are 32GB and the databases on these box fixes into about 25GB (a bit of room for growth).

Right now we have a hacked InnoDB warmup script that runs a number of SELECTS into a blackhole table which warmup the DB.

This MUST be amazingly inefficient as InnoDB could just scan the DB moving pages into memory as it reads forward. One sequential scan of the table would be a LOT faster than our current random read. Right now it takes us about 1.5 hours to warmup a 25GB database image.

This is slightly related to a persistent and restored LRU buffer pool which Jeremy Cole suggested but different enough to suggest a separate architecture.

Faster recovery

Right now InnoDB recovery is DOG slow and 100% CPU bound. Apparently, this us due to a poorly designed algorithm that does a full scan of a linked list on recovery.

I need to see this fixed because when a DB crashes I need to get it up and online ASAP.

This is also needed to improve our usage of InnoDB hotcopy for performing slave clones. We take a snapshot of a running InnoDB database and perform recovery on a slave with the raw binary image. Works great other than the fact that recovery takes ages.

Improved Operation for 100% Memory Databases

If we improve the performance of recovery, there exists some very amazing potential for ULTRA fast performance when running entirely out of memory.

One would use LARGE write ahead logs (20-50GB). Enough to hold a few hours of transactions.

Then we could enable a mode for InnoDB to disable fuzzy checkpointing and instead continually write the full DB image sequentially to disk. The DB would be continually doing a full checkpoint and deleting the previous database image. If a box crashed the DB would just be restored from the write ahead logs.

The write transaction throughput in this situation would be seriously improved. In theory this would be about 1/2 the full sequential throughput of the drive. Fifty percent of the write performance would go to the WAL and the other 50% would be to the data files.

This would make InnoDB semi-log structured.

Another alternative is to just continually flush starting from the beginning of the buffer pool and on to the end. Then updating the WAL checkpoint so that the oldest block seen by the flush thread is cleaned up.

You would need to be 100% certain that your logs were sized large enough to store plenty of transactions so that it doesn’t force a checkpoint.

It could also completely eliminate InnoDB non-contiguous data and disk fragmentation.

Of course this might be a completely insane idea and perhaps I should just be using a log structure storage engine to begin with.

Violin’s Flash Device

Well this is awesome:

The flash version has an eight times larger capacity of 4TB, starting at 320GB and consists single level cell (SLC) NAND flash. Its latency is around 23 times slower at 70 microseconds. It supports more than 100,000 sustained random write IOPS and 200,000 read IOPS (4K blocks) and can do so for ten years. According to Violin, users would need 500 15,000 rpm Fibre Channel drives to deliver this level of performance.

The biggest barrier to running MySQL/InnoDB on a device like this is that it’s going to be 100% CPU bound.

My advice to the guys at Fusion IO or Violin is to give/loan one of your lower end machines to Percona, MySQL AB, or even just reach out to one of the alpha geeks.

Of course over time this stuff will just get fixed.

Your devices are going to be CPU bound at first but with motivation they will quickly fix the problem and you’ll sell more customers.

We’re migrating from MySQL 4.1.x to 5.0.x at work and one of the key features we need is the ability to freeze InnoDB and prevent it from writing to disk.

We do this to aid in syncing masters and slaves and performing backups.

Basically we freeze a master, copy the data to a new slave, unfreeze the master, bring up the new slave, and then setup replication to start from right after we froze the master.

It’s MUCH faster than performing a mysqldump (20x faster). A mysqldump tends to both do a ton of random seeks on disk as well as burn up a single core.

This type of ‘innodb hot copy’ is only bottlenecked on disk and gigabit ethernet IO.

In theory you can sync at 125MB/s…

David ended up spending the time to isolate, test, and retarget the patch for various MySQL versions.

We tested this and one of the interesting properties is that it can actually continue to execute UPDATES as long as you’re using innodb_flush_log_at_trx_commit=0.

Anyway, I’d really like to see this land in Drizzle, and MySQL +5.0.69.

It only modifies eight lines of code so seems like a pretty no brainer.

Up until this point we were using xfs_freeze but ran into a nasty bug where read() would block during a file copy. Apparently, xfs_freeze was designed to block for writes but not for reads.

My theory is that there’s a bug or a race condition between sync and xfs_freeze which is leading to a partially dirty page cache preventing us from reading while the filesystem is frozen.

Pretty InnoDB Buffer Pool Stats

The output from SHOW INNODB STATUS isn’t very pretty or easy to use. Seriously, other than Heikki who thinks in 16k block sizes? :-)

A few lines of bash magic fixes this problem.

Now I can quickly see buffer pool stats in the following format:

buffer pool size: 27999076352 bytes (27G)
used: 16563568640 bytes (16G) (59.00%)
modified db pages: 4747952128 bytes (4G) (16.00%)

Here’s what the code looks like:


# Pretty print InnoDB buffer stats.  
# SHOW INNODB STATUS looks like:
# Buffer pool size   1708928
# Free buffers       1142066
# Database pages     565676
# Modified db pages  123467

pp() {
    if [ $value -gt 1000000000 ]; then
        value=$(expr $value / 1000000000)G
    elif [ $value -gt 1000000 ]; then
        value=$(expr $value / 1000000)M
    elif [ $value -gt 1000 ]; then
        value=$(expr $value / 1000)K

    echo $value


perc() {


    echo "scale = 2; ($nr / $total ) * 100" | bc


# don't let bash screw up multiline parsing (I hate this bug)

stats=$(echo "SHOW ENGINE INNODB STATUS\G" | mysql -N)

#echo $stats
set -o noglob

buffer_pool_size=$(echo $stats | grep -E '^Buffer pool size' | grep -Eo '[0-9]+$')
buffer_pool_size=$(expr $buffer_pool_size * 16384)

free_pool_size=$(echo $stats | grep -E '^Free buffers' | grep -Eo '[0-9]+$')
free_pool_size=$(expr $free_pool_size * 16384)

modified_db_pages=$(echo $stats | grep -E '^Modified db pages' | grep -Eo '[0-9]+$')
modified_db_pages=$(expr $modified_db_pages * 16384)

used=$(expr $buffer_pool_size - $free_pool_size)

echo "buffer pool size: $buffer_pool_size bytes ($(pp $buffer_pool_size))" 
echo "used: $used bytes ($(pp $used)) ($(perc $used $buffer_pool_size)%)"  

echo -n "modified db pages: $modified_db_pages bytes "
echo -n "($(pp $modified_db_pages)) "
echo    " ($(perc $modified_db_pages $buffer_pool_size)%)" 


I want a new command in MySQL/InnoDB to freeze all filesystem activity in InnoDB… until I unfreeze it?

Why? I want mysqlhotcopy semantics for InnoDB.

Right now you can run FLUSH TABLES WITH READ LOCK and then xfs_freeze but it’s not very efficient or easy to code, and it’s pretty fragile.

INNODB FREEZE should not require a connection to remain open and should require an INNODB UNFREEZE to return IO to its normal state.

Further, when you copy away an underlying InnoDB database to a new server, crash recovery should NOT be required.

What really bothers me about the recent desire for MySQL features (triggers, etc) is that these type of features mean a LOT more to me. Operational competence in MySQL is very important and it seem the MySQL developers often forge this…

Drizzle and the InnoDB plugin are starting to show the way though….


I also want the ability to tell InnoDB to import a tablespace from a given .ibd created with innodb-file-per-table REGARDLESS of transaction identifiers and other information.

I know the tables is from another server, just import it already…