Powerset + Hadoop @ Rapleaf

200712182243Rapleaf was nice enough to host an hbase and bigtable meetup tonight at their offices in downtown San Francisco.

Progress is being made but my gut tells me that this thing won’t be ready for real world use for about a year or two. I’ll stick to MySQL in Spinn3r, at least for the short term.

Not that people aren’t trying to move it forward as fast as possible.

Yahoo has a huge cluster that they’re using to find bugs. Facebook has put 1.3B rows into hbase as a side project. 25k writes per second peak. Not a lot of data. Just stress the system to see if it could break – and they succeeded.

There are also core design flaws. For example, they use threaded IO. This means that every client connection will require a dedicated thread. This just won’t scale. Most VMs will dump core at about 2k threads. Java threads require about 128k of memory per thread. Need 1k threads? Better allocate 128M of memory.

There’s no reason not to handle 10k concurrent connections.

Rapleaf is apparently playing with it as well. Sounds like they have about 20-100G of data dumped into hbase. They’ve also built out a REST interface on top of it.

Other choices here would be KFS or federated MySQL. I’d recommend MySQL if you want to ship this year.

There’s still a lot of progress to be made here. Myself? I want a bigtable clone in C from the bottom up. I want it to use async IO, and I want it to be aware of cluster topology.

Also, where does this leave MySQL? I’m starting to see more and more Open Source database mind share drift into the GFS/Bigtable realm. Specifically because everyone know that MySQL doesn’t scale unless you have a federated backend? The problem is that there isn’t an Open Source federated backend to chose from.


  1. cl3m

    What about couchdb? Probably not if you want to ship this year.. but could it scale or do you see any design flaw ? ^^

  2. brianaker

    Hi!

    Do you think a bigtable really needs to be written in C? Is it that important?

    What about building one into Apache?

    I would not worry about MySQL. There will always be a need for fast meta data repositories. Relational databases are not going to go away, they have their niche.

    Is the world finally ready to move past SQL?

    Cheers,
    -Brian

  3. Yeah. I think it needs to be in C..

    O_DIRECT, mlock, sendfile, etc.

    None of this is really available in Java.

    You could of course write JNI bindings for the critical pieces. Then of course you’re stuck with a Java and that’s no fun. :)

    Also, it’s such a CRITICAL piece of your infrastructure, you might as well keep it tight.

    Java databases right now lose about 20% of their memory on Linux because of the VM pressure bug I found. You basically have to under-provision your app’s caching layer because Linux decides it needs the memory for your filesystem cache.

    Innodb gets past this problem by using O_DIRECT.

    Kevin

  4. chadwa

    Kevin,

    First, please realize that Hbase is only at an alpha stage and our current focus is on stability. It is definitely not fully-baked and there is certainly more work to be done but there are some folks using it now and getting value out of it.

    WRT some of your specific concerns:

    1. The threaded IO is not a design flaw, just a feature of the current implementation. When we swap in a different RPC mechanism, async IO should come along for free.

    2. Cluster topology awareness is definitely somewhere on the roadmap. It requires coordination with HDFS and Map/Reduce, though, and it is a performance enhancement, so it comes after stability in the priority queue.

    3. C vs Java: Frankly our focus is on getting this kind of functionality in place at all, rather than squeezing out the last bits of performance. We believe that Hbase has significant advantages by leveraging the great community and growing profile of the Hadoop project that far outweigh the perceived downsides of working in Java. When we see a C-based implementation up and stable, we can certainly compare to see how much difference it makes. We’ve talked with the folks building Hypertable (http://code.google.com/p/hypertable/) about a certainly level of API compatibility so maybe when they are ready to roll, it hopefully it will be easy to drop it in place and make comparisons.

    4. WRT MySQL, its probably a little premature to make much of a comparison and the two systems generally target different use cases. I will say that for use cases where the two could overlap, I’d rather spend some more money on machines than pay the price of operating a federated MySQL system, with all the attendant hassles of repartitioning, schema maintenance, replication, etc.

    At the end of the day, it’s not about raw performance but about scalability — and Hbase should have no problem getting really, really big. Once your table is spread across enough spindles, the performance should be plenty good enough to handle most workloads.

    Chad Walters
    Powerset

  5. askpaul

    I agree. Wouldn’t it be neat.

  6. See my thoughts below:

    First, please realize that Hbase is only at an alpha stage and our current
    focus is on stability. It is definitely not fully-baked and there is certainly
    more work to be done but there are some folks using it now and getting value
    out of it.

    Agreed. I was just trying to serialize my thoughts WRT hbase in general under
    the assumption that people reading the post wanted a quick status update on
    porting their applications.

    I have a lot of people who read my blog from MySQL planet who come from a
    different background.

    WRT some of your specific concerns:

    1. The threaded IO is not a design flaw, just a feature of the current
    implementation. When we swap in a different RPC mechanism, async IO should
    come along for free.

    Fair enough. This is a bit of a pet peeve of mine though. MySQL does threaded
    IO (which drives me crazy).

    The #1 thing prompting me to ditch everything and write my own bigtable
    implementation from scratch in C would be async IO. :)

    2. Cluster topology awareness is definitely somewhere on the roadmap. It
    requires coordination with HDFS and Map/Reduce, though, and it is a
    performance enhancement, so it comes after stability in the priority queue.

    yup… it’s a significant one though.

    The reason I brought it up is that most MySQL partitioning code I’ve seen has
    this integrated because they’ve been able to focus on the architecture from the
    ground up.

    This will take a while though. I expect it won’t really be complete for about
    the next 12-18 months.

    3. C vs Java: Frankly our focus is on getting this kind of functionality in
    place at all, rather than squeezing out the last bits of performance.

    I hear you but I’m not really talking about 1-2% performance boost here. I’m
    talking about real world issues that can result in 30-50% performance boost.

    When you have a cluster of 1000 nodes, a 50% performance boost is a LOT of
    computing power.

    W’re seeing this a lot in innodb for example. It’s able to be tuned a LOT more
    than hbase right now but it’s still FAR from perfect. The CPU scalability right
    now is horrible and will only start to fall over more as we see more systems
    with multiple cores.

    There are performace optizations that are only possible when you consider them
    as a single holistic design philosophy. These are things like using async IO,
    sendfile, O_DIRECT, etc.

    We believe that Hbase has significant advantages by leveraging the great
    community and growing profile of the Hadoop project that far outweigh the
    perceived downsides of working in Java. When we see a C-based implementation
    up and stable, we can certainly compare to see how much difference it makes.

    You could potentially squeeze more performance out of a Java implementation but
    C just has far too many unfair advantages by being closer to the metal.

    This from a guy whose written far too much Java code than I care to remember. :)

    We’ve talked with the folks building Hypertable
    (http://code.google.com/p/hypertable/) about a certainly level of API
    compatibility so maybe when they are ready to roll, it hopefully it will be
    easy to drop it in place and make comparisons.

    That would be more than awesome. Knock on wood.

    4. WRT MySQL, its probably a little premature to make much of a comparison and
    the two systems generally target different use cases. I will say that for use
    cases where the two could overlap, I’d rather spend some more money on
    machines than pay the price of operating a federated MySQL system, with all
    the attendant hassles of repartitioning, schema maintenance, replication, etc.

    A good federated MySQL install would handle repartitioning, replication, and
    schema maintenance. The one we’re working on does. :).

    That said. I *would* rather have a scalable bigtable database that was ready for
    use in the real world so I really hope I’m 100% wrong on all these issues.

    At the end of the day, it’s not about raw performance but about scalability —

    Actually. I think it’s about both. If you have excellent scalability but each
    node is highly inefficient you’re going to be spending a lot of cash on
    hardware.

    and Hbase should have no problem getting really, really big. Once your table
    is spread across enough spindles, the performance should be plenty good enough
    to handle most workloads.

    Yup. My worry right now is the per machine performance optimizations of hbase.
    No reason to start from scratch with all the right decisions!

    Kevin

  7. Kevin,

    Could you elaborate on the bug you found? You write:

    Java databases right now lose about 20% of their memory on Linux because of the VM pressure bug I found. You basically have to under-provision your app’s caching layer because Linux decides it needs the memory for your filesystem cache.

    I’m very curious to learn more about this.

  8. robertengels

    I think your assessment of the O_DIRECT problems under Java are not correct.

    If the Java program implements its own cache level, then the user program is going to read from the cache, not from the disk. If this is the only (primary) application on the server, the OS will end up shrinking the OS cache buffers as it sees a low hit percentage on reads (as long as you don’t fix the OS disk cache size).

    If the Java program doesn’t use a cache layer, then it relies on the OS disk cache, and using NIO makes accessing these very efficient.

    I have run disk speed benchmarks in Java and it matches the C speed (and rating drive/bus speeds) almost exactly.

    The problem with many Java applications is that they don’t implement the cache layer correctly, and either lose performance due to GC, or cache the wrong structures, wasting memory and not helping performance.

  9. Robert,

    You bring up some good points.

    You’re correct that if an application performs 100% of it’s own caching then O_DIRECT will prevent the OS from swapping.

    The problem comes when there are parts of an application that read from disk or different subsystems that don’t use the caching layer.

    For example, say you only buffer 90% of the workload. Eventually your DB will start paging.

    In our situation we buffer 100% of our DB but MySQL can’t open the binary logs with O_DIRECT so this eventually causes VM pressure and cause the OS to cache.

    It would be NICE if there were patches for this so that the OS didn’t HAVE to cache but I haven’t had a chance to dive into the kernel to check things out.

  10. robertengels

    In general, MOST database systems do their some of their own caching (index root pages, etc.) since it can be more efficient/intelligent than generic OS caching. It is also not susceptible to being discarded from the cache by other processes, BUT, the process may have it’s data paged – under Java – essentially causing the same issue…

    This is why a “real” database will request some memory and lock it from paging. But if your Java app is having frequently used “cached data” paged to disk, then the server/OS/application is probably not configured for adequate performance anyway…

  11. Robert.

    Yes….. but Java is prone to being paged because madvise, mlock, and mlockall aren’t available under Java.

    O_DIRECT potentially fixes this but it’s still problematic.

    One COULD port them by writing JNI bindings. I would assume they would work…

    The bug that I found which cause Linux to page process by incorrectly evicting memory is STILL a bug of course.

    Kevin

  12. robertengels

    It is true that you can’t control it from a Java application directly – but you can essentially control it via locality of access – which the JVM is VERY good at. To get the OS to not page your user cache, set the swappiness (see below).

    Many would argue that if the user cache was effective (used a lot), then it would not be paged to begin with, and if it is being paged it is to load data blocks vs. index blocks (which depending on the query might lead to better performance!) The OS is smart enough to not page out pages that it THINKS will be used again just to buffer writes and reads that may not be used again. The problem is that the locality is such that the OS does not think those pages are worth keeping in memory.

    As an example, imagine you had a query that displays every record in a table and this query was run a lot. If only the db’s index cache was used, and no OS cache was available, the index cache would not be useful, and every record would be read from disk, leading to poor performance of the same query on each invocation. (Often the database will cache datablocks as well, but not always).

    I also don’t think you can say “Java is prone to being paged”… any non-root application would have similar problems, OR they would cause other applications to run so slowly (first apps grabs and locks all but a little memory, all of the other apps would be paging constantly as they ran).

    As for the Linux “bug”…

    The following gives a great detailed report on what is happening (btw, it is with Oracle – that can easily use O_DIRECT…).

    http://web.cs.wpi.edu/~claypool/mqp/linux-oltp/mqp.pdf

    Also, the following

    http://kerneltrap.org/node/3000

    states that setting the swappiness should do what you desire. It would also work with a Java based database.

    Just some thoughts.

  13. robertengels

    Also, it seems there may be differences in the implementation of swappiness based on kernel version.

    see

    http://lwn.net/Articles/100978/

  14. robertengels

    Lastly, here’s a kernel bug report that seems to track the issue:

    http://bugzilla.kernel.org/show_bug.cgi?id=7372

  15. > The OS is smart enough to not page out pages that it THINKS will be used again
    > just to buffer writes and reads that may not be used again. The problem is
    > that the locality is such that the OS does not think those pages are worth
    > keeping in memory.

    Yes. But the problem is that the OS can make a wrong decision. If you KNOW
    your application doesn’t need to EVER page then allocating the full amount of
    memory to buffer the DB seems to make sense.

    Problems can arise when the OS thinks it’s smarter than your application and
    pages it to disk anyway.

    Of course, in a perfect world , the OS *would* be smart so keeping the
    filesystem cache live under the covers would work just fine.

    > As an example, imagine you had a query that displays every record in a table
    > and this query was run a lot. If only the db’s index cache was used, and no OS
    > cache was available, the index cache would not be useful, and every record
    > would be read from disk, leading to poor performance of the same query on each
    > invocation. (Often the database will cache datablocks as well, but not
    > always).

    .. yes. but a filesystem cache would be poor at this too since they run similar
    algorithms.

    > I also don’t think you can say “Java is prone to being paged”… any non-root
    > application would have similar problems, OR they would cause other
    > applications to run so slowly (first apps grabs and locks all but a little
    > memory, all of the other apps would be paging constantly as they ran).

    Sure…. I should probably say any application without native OS hooks like
    mlock or mlockall are prone to being paged.

    > http://web.cs.wpi.edu/~claypool/mqp/linux-oltp/mqp.pdf
    >
    >
    >
    > http://kerneltrap.org/node/3000

    Yup. I should note that my blog is #3 for a search for ‘swappiness’ on google :)

    >
    > Also, it seems there may be differences in the implementation of swappiness based on kernel version.
    >
    > see
    >
    > http://lwn.net/Articles/100978/

    OK….. THAT is just plain frightening!

    >
    > Lastly, here’s a kernel bug report that seems to track the issue:
    >
    > http://bugzilla.kernel.org/show_bug.cgi?id=7372

    Awesome. I’m going to have to dive into this one.






%d bloggers like this: