SSD Vendors: Please let developers obtain extended health and # of erase cycle stats on your SSDs.

Here’s the problem I currently have.

We’re looking at deploying the Intel X-25M MLC SSD in production.

The problem being that this drive has a lower number of erase cycles but is much cheaper. Than the Intel X-25E SLC drive.

However, in our situation we’re write once, read many. I’m 99% certain that we will not burn out these drives. We write data to disk once and it is never written again.

The problem is that I can’t be 100% sure that this is the case. There is btree flushing, and binary log issues that I’m worried about…

What would be really nice is an API (SMART?) that I can enumerate the erase blocks on the drive, determine the max erase cycles, and read the current number of erase cycles.

This way, I can put an SSD into production, then determine the ETA to failure.

I can also add this to Nagios and Ganglia and trend the failure date and alert if the derivative is too high and the drive will soon fail.

Further, I can figure out if a database design is flawed. If I deploy a new database into production and the failure ETA is too high after 24 hours I know that something is wrong. Either a misconfiguration or a problem with the design.

I think this would solve a LOT of the problems with deploying SSD in enterprise environments. (MySQL, Oracle, etc)

  1. Look at the SMART info:

    232 Unknown_Attribute 0x0003 100 100 010 Pre-fail Always – 0
    233 Unknown_Attribute 0x0002 099 099 000 Old_age Always – 0

    232 is the reserved space available. This will fall off a cliff when the drive is old.

    233 is the wearout indicator. It will steadily fall as the drive ages.

  2. Hey Greg.

    Thanks… I actually did see this but I didn’t see it documented anywhere…..

    I’ll have to dive in ….

    Note that 232 is already in the ‘worst’ stage? Is your drive about to fail? :)

  3. WORST is the drive’s recollection of the worst it’s seen, not necessarily a problem. When WORST < THRESH there's a problem.

  4. Ah.. that makes sense… I think I’ve actually seen this before.


    We’re about to deploy a large number of SSDs… so should be fun :)

  5. I was thinking it would be fun to see how much writing it took to cause 233 to fall by one…

  6. What we were thinking of doing is put this metric in Ganglia so that we can compute an ETA to failure..

    In theory it should be like 4-15 years on a highly utilized SSD…. but if something is wrong and it’s only 3 months we want to detect it.

    We’re considering using MLC drives for some situations and we don’t THINK we will burn them out but this way we can find out sooner rather than later.

  7. We have a client running with the Intel X-25E SLC SSDs in production, in a hardware RAID config.

    If you also have SLCs, why not put one MLC in the RAID and see how it performs – if it fails, you’re still good. Particularly if it’s on a slave rather than a master.

  8. Interesting info. For fun, here is the output from one of our SSDs which has been running in production for 130 days. It is serving the content for this very blog actually :)

    232 Unknown_Attribute       0x0003   100   100   010    Pre-fail  Always       -       0
    233 Unknown_Attribute       0x0002   099   099   000    Old_age   Always       -       0
  9. Sweet…. that’s awesome Barry :)

    This is an MLC?

  10. Yep, it’s an X-25M MLC 80GB drive. Also, you should enable threaded comments on your blog (Settings –> Discussion).

%d bloggers like this: