Using O_DIRECT on Linux and INNODB to Fix Swap Insanity

When we migrated to 64bit Debian Etch about two months ago we noticed that some of our database boxes were paging even though they had plenty of memory to perform the tasks at hand.

The boxes had 8G and we told INNODB to allocate 7G to the buffer pool.

The problem is that after using about 6G the machine would start to swap constantly. Swapping out 1G of ram to disk is NOT going to be fast.

I couldn’t figure out what was going on. The swappiness setting was set to zero. As far as I was concerned the OS should NOT start to use the swap file until virtual memory uses exceeds the amount of physical memory. Boy was I wrong.

What appears to be happening is that the Linux kernel decides that the filesystem cache is more important than your process. It then makes the brilliant observation that sticking your process in swap, so that it can allocate more memory to the filesystem cache, is the right choice for performance.

With INNODB this is INSANITY because it’s ALREADY buffering the data.

Duplicating the problem was easy. I just created a new blackhole table and inserted data into the table.

This tricks MySQL to buffer all the rows in the buffer pool.

The second trick is to convince MySQL to also buffer the indexes. This can be done with:


INSERT INTO FOO_BLACKHOLE
SELECT * FROM FOO FORCE INDEX (MY_INDEX)
WHERE MY_COLUMN IS NOT NULL ORDER BY MY_COLUMN ;

One trick is to tell INNODB to open the file with O_DIRECT:

with O_DIRECT the kernel will do DMA directly from/to the physical memory pointed [to] by the userspace buffer passed as [a] parameter to the read/write syscalls. So there will be no CPU and memory bandwidth spent in the copies between userspace memory and kernel cache, and there will be no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks etc..).

Here’s what it looks like before running our tests:

200709291314

EVIL! 102M in swap even though the box has 1.2G free.

Now check this out:

200709291316

Awesome! The box is NOT swapping and still has 500M free to run other processes. I couldn’t be happier!

If you’re on MyISAM or another storage engine you might be out of luck.

This is pretty much insanity on the part of the Linux kernel. This needs to be fixed.

Update

This is Linux 2.6.18…

Update 2

Another way to solve this problem is to run without swap. The problem here is that if we go one byte above the amount of physical memory the system uses the OOM killer will kick in.

Of course the OOM killer is probably going to kill mysqld since it’s using so much memory.

While some people are running without swap it just seems like a bad idea. Basically you’re walking around with a gun pointed at your head waiting to go off if you accidentally sneeze.

Update 3

I was wrong. It looks like O_DIRECT will bypass the page cache for READs but NOT for writes. I ran another benchmark doing a mysqldump and restore and the system started paging.

Hm.

Update 4

I’m still skeptical about the stability of memlock. There are a few MySQL notes about it being unstable on Linux but no specific details.

I can only conclude that not many people are using this setting.

I’m going to run a slave in replication but keep it out of production for a few days to see if anything crashes.

Update 5

We’re running memlock along with O_DIRECT on one of our production servers right now and so far no crashes or swap file activity.

I think this is the way forward on this issue.

Jeremy Cole noted that it’s acceptable to page some of the innodb buffer pool but I don’t agree. The buffer pool should simply remove blocks that are inactive before paging is necessary.


  1. vvhung

    Another way to get rid of swap is using Linux HugePage, especially when you have very large memory.
    Use MySQL option: –large-pages to enable it using HugePage.

    There’s a awesome reference here: http://www.puschitz.com/TuningLinuxForOracle.shtml

  2. Hm…. yeah. I’ve been looking at large pages. Why would these bypass the page cache?

  3. diegobel

    GREAT!

    I have EXACTLY the same problem and for months I have to reboot the process twice a month!

    Now, after reading some replies on this page: http://www.mysqlperformanceblog.com/2007/10/26/heikki-tuuri-innodb-answers-part-i/ at the Q19, I was looking for more information about memlock, to try to solve the problem.

    If you look in search engine it doesn’t seems to be a common problem… anyone know the reason? I’ll try this way and let you know!

  4. angus

    you should restart your bench w/ the new SplitLRU kernel patch (available in latest 2.6.28 linux kernel) with or without O_DIRECT & libaio






%d bloggers like this: