MySQL Cluster and the Death of Secondary Indexes

Relational databases use indexes to speed up their performance usually due to the fact that their underlying storage is inherently slow (which has historically meant disk).

I’ve often created secondary indexes with MyISAM which aren’t used in day-to-day operation but that are needed for debugging or exploration of the database.

For example, if you were to have a USER table with a handle (their account name) and an ID (the primary key) you could index by the handle but usually these are long strings and take up a good amount of CPU time to build the index and memory to cache the index.

It dawned on me the other day that this isn’t necessary with MySQL Cluster (NDB). With MyISAM you need to use an index because a full table scan might take 10-20 minutes and hold back any INSERTs in the process which could become a big problem.

It also has secondary problems including confusing the OS level page cache and even with INNODB will slow down the system due to the additional IO.

Since NDB is an all memory database no disk seeks are involved which means the entire query can happen in memory. If you have engine_condition_pushdown enabled this is even faster because you can use the CPU on all your data nodes in parallel to find the result.

This has two main wins for NDB. First, you can avoid the CPU costs with continually maintaining a redundant index. Second, you can save a great deal of memory and use it for storing data instead of index. This could probably give you 10-20% additional memory (and memory isn’t free).

As a test I ran this on a table with 1M rows. I think a full table scan would have taken 1-5 minutes without an index using MyISAM. On NDB with engine_condition_pushdown this only too 1.2 seconds. Without engine_condition_pushdown it took 30 seconds since the data needed to be sent to the SQL node and computed there.

  1. Maybe this isn’t quite related to what you are talking about, but I would imagine an index on the handle (username for example) would be useful in order to serve web applications that accept the username in requests.

  2. Martin Skold

    For equality and range searches of a large table using an index can be much faster than a full table scan, even if you use engine_condition_pushdown.
    Note that Ndb creates an ordered index for primary keys (and unique indexes) which might be used instead, transparently, when you think you are doing a full table scan (so be careful when benchmarking).
    If you don’t want these you need to specify USING HASH, then you only get the hash indexes for primary keys and unique secondary indexes. Also a equality search of a unique index is just one hash lookup, similar in performance to a primary key access.

  3. David Hersey

    With a sufficiently large in-memory database platform, one might question the need for a relational query engine in the first place.

    Relational databases are basically a compensating technology for 2-dimensional storage. When we achieve sufficiently large in-memory storage where every location is directly addressable, we won’t need them any more; we will only need the same in-memory hashing and indirection mechanisms we use today.

%d bloggers like this: