Estimating Performance Gains with Smaller Tables

Tailrank has a few large tables in our MySQL database that could benefit from a bit of truncation for (hopefully) some performance gain. The theory goes that if you have less data you’ll have a faster system. This isn’t always a one-to-one comparison of course because if you delete enough data MySQL will eventually just buffer the whole data in memory and you’ll get an order of magnitude more performance.

All things being equal though what kind of performance boost would you get for SELECTs if you were to take a 20M node table and truncate it by 80%.

The answer is that you’d only receive a 9.3% performance boost. Since btrees are log(N) this means that in order to compute the boost you’d can just use the equation:

(log(N) - log(N_after) / log(N)) * 100

This boost obviously isn’t worth the hassle and we’re just going to migrate to a new table schema which should greatly improve performance.


  1. I have had a very different experience with MySQL and table size. With GrabPERF, the performance of the main page can vary dramatically based on the number of rows and distinct measurements it has to consider to build the Top/Bottom 20 tables.

    Now, I don’t claim to be the best MySQL query or database optimizer in the world, but when I culled out a large number sites being measured 10 days ago, the generation of the Top 20 tables dropped to sub 1 second.

    Now that people have contacted me and said that they were actually looking at the data, I re-institated a number of measurement. Table generation fell into the 1.4-1.5 second range.

    I think that query optimization (Hey! This EXPLAIN thing is really cool!), proper indexing and proper table construction is key to ensuring that queries remain fast as your database grows.

    smp

  2. The question you REALLY answered in your post is “how much shorter is the path from the root to the leaf nodes in an index.” That is not at all the same thing as “what kind of performance boost would you get for SELECTs.”

  3. Kevin,

    This is right estimate in theory, even though you should consider base for your log(), which is often quite large as many keys fit to the same page.

    In practice caching comes in play which changes a lot. It is wrong to say it just happens at once – you table was not fitting in cache and now it does. Normally you would get gradual efficiency request which depends on your working set – ie it was 100% comming from cache, later it become 95% – this could happen when you data is 10% more than amount of cache available or can be when it is double the same for many workloads.

    The other interesting thing for IO bound workload is – you normally need 2 IOs to retrieve the row (1 for Innodb clustered key) to read the keyblock on the bottom of btree and to read the row itself.

    All upper level of btree will be fully cached unless extremly large tables as they quite likely be only few percent of index size.

  4. Xaprb:

    “The question you REALLY answered in your post is “how much shorter is the path from the root to the leaf nodes in an index.” That is not at all the same thing as ‘what kind of performance boost would you get for SELECTs.'”

    Hm…….. I don’t think so. The order of search in a btree is O(log(N)) which is *also* the same depth as the root to the leaf nodes.

    So the performance boost is a function of the order above…..

    But I think Peter might have a point. There’s probably a function to compute the reduced disk seeks needed due to the smaller data size and denser btree (I just don’t know what that is)…….

    Kevin