Java and In-Memory Data Structures – A Flat File Map Proposal

In the past I’ve often used in-memory data structures (vs on disk) in situations where allocating say 5-10MB of data in the local VM is much better than continually hitting a database.

This has only worked because I’m not using very large data structures. My language classification codes uses about 8MB of memory stored in a TreeMap for it’s trained corpus based on Wikipedia.

Recently, I’ve been playing with a much larger corpus for a new feature for Spinn3r.

It originally started off using 500MB which is WAY too much. Bumping it down to 350MB by using a TreeMap was a start but that’s no where NEAR far enough.

What’s happening is that Java itself has per-object overhead. In my analysis it takes 38 bytes for just a basic java.lang.Object.

This is way too much data. If you’re using 8 byte keys (64bit ints) then using 46 bytes to represent them is nearly 6x additional overhead.

In my situation I’m nearly seeing 10x overhead. This 350MB memory image of my data structure can be represented as just 35MB on disk. This would be more than fine for my application.

Which is when I had an idea – why am I using a pre-indexed data structure at all? Why am I keeping millions of copies of java.lang.String and java.lang.Integer instances?

Here’s what I’m going to do. I’m going to write an in-memory memory mapped file implementation of java.util.Map. Using the new NIO support for mmap files I’m going to write out a file with all the keys sorted and followed by its data.

So the on-disk format will like:

magic_number
nr_keys
key:data
key:data
...

I read the number of keys and then a perform a binary search on the key region for my key base on it’s byte[] encoding.

The first pass will only work on fixed width keys and values but it should be easy to add variable width (though this might require a bit more overhead.

This should yield a 10x memory savings with only a slight CPU hit while converting the key to a byte array and parsing the data byte array into an actual java primitive.


  1. Hmm, this looks interesting, i’m looking for something like this for a java project i’ve been working on, it currently uses a live database, but this is unsuitable with large amounts of data, I was looking on good ways to store large amounts of data in java memory and stumbled upon this :) (damn google is fast).

    I hope this project will be open source? If so I will keep a close eye on it :)

  2. Hey Ozafy,

    Yeah… I’m going to open source it :)

    Kevin

  3. Hm. I don’t know where the 48 bytes thing is coming from. A default Java object uses two words worth of data (8 bytes on a 32 bit machine, 16 on a 64). Fields create the obvious amounts of additional overhead.

    If you’re using longs as keys and strings as values, why not use one of the many perfectly good implementations of long keyed hash maps (I use the ones from colt – http://dsd.lbl.gov/~hoschek/colt/) and intern all your strings? You’ll probably find that cuts down on memory use a lot.

  4. Daniel Yokomizo

    Hmm, if the data is sorted why not go all the way to implement java.util.SortedMap? Do you plan to support mutation (e.g. put, remove and friends)? You would also need to lock the file against concurrent updates while transversing it. As a whole it looks like an interesting project, if you need any help on it I’m interested ;)

  5. Hey David,

    The 48 bytes is coming from measuring the data by actually creating a new VM, creating say 1M objects, and them measure the actual real world memory size of the resulting VM.

    String interning in this situation won’t help as they’re all unique.

  6. Daniel,

    The java.util.SortedMap is an interface with only one implementation in the VM – TreeMap.

    TreeMap is actually pretty good but the per-object overhead is just far too high.

    The file once written will be immutable. Further, I should be able to get around any thread safety issues and keep state on a per thread basis. This way no synchronization will be needed to traverse the the list of sorted keys.

  7. Also, I’ll check out Colt… I’ve looked at Trove and Javalution and none of them have done what I’ve wanted them to do.

    Kevin

  8. Further, I wonder if multiple ‘VMs can share the memory mapped data……… hm.

  9. That’s not an accurate test though. The JVM will generally reserve a larger heap size than it needs to hold all the objects you’ve allocated. It adopts a policy of grabbing as much memory as it needs to be fast and only plays nice with it once it starts reaching its set limits. Allocating more probably wouldn’t cause it to grow linearly. The tests I’ve performed (which support the theoretical numbers) involved allocating objects until the JVM runs out of memory. It’s not a perfect test, but it does a bit better. If you want more detailed information you’ll probably need to do some sort of plot of objects to memory usage.

    If string interning wouldn’t help, are your strings ascii? If so, unless they’re really short strings the overhead is going to be much more the result of the two byte characters than the object itself. Replacing them with byte arrays might help.

    If you’re really set on the on disk thing, go for it. I certainly won’t try to stop you. :-) But I suspect you can go a long way towards reducing the in memory size.

  10. David,

    I’m aware of the VM overhead of heap allocations.

    I loaded my code in JProfiler to look at the actual used in-memory footprint as well as compared this to Runtime.totalMemory – Runtime.freeMemory.

    The in memory maps are just not as efficient in Java as they could be due to overhead.

    The strings are from 3-8 chars in lengths. They’re also already interned by the fact that they’re in a Map as the key. :)

    You do bring up a good point though. Replacing String usage with just plain ascii byte[] arrays would be more of a fair comparison for my on disk representation.

    I still expect TreeMap to lose :)

    Kevin

  11. Sure. The in memory maps have a large memory footprint. (I’m not suggesting you use treemap anyway). I’m specifically disputing the 38 bytes per object claim. Can you post the code you used to demonstrate that?

    Also, why make it on disk? If you’re looking for an in memory structure, a large byte array could quite happily do the same thing.

  12. Hey David.

    Here’s the source.

    http://pastebin.ca/954705

    Run it yourself….

    System.gc() is just advice to the VM of course but in practice it works to kick in the GC on my end.

    Matches up with Jprofiler too.

    I ran the test multiple times to start off with so I don’t measure Classes and other data stored in the VM which isn’t necessary.

    Also, looks like there’s about a 13% boost by using byte[] vs String.

    Kevin

  13. Ok. But those are tests for the size of maps in memory. In fact, the fact that you only get a 13% decrease in memory use by halving the size of your keys shows just how much of that memory usage is the map structure.

    I’m entirely prepared to believe your results for those. I’ve not said anything about the (in)efficiency of the standard collections for in memory use. I’m objecting to “In my analysis it takes 38 bytes for just a basic java.lang.Object”, which I don’t see any evidence of in there.

  14. akuhn

    David … an object without reference gets garbage collected, so you must add a third word to your minimal size, which totals to 12 / 24 bytes then. And dont forget that each Java string keeps a start and end index internally, so an empty! String totals to 20 / 40 bytes. Kevin’s solution is very elegant as he avoid keeping neither headers nor references nor silly indices around. This will save 20 bytes for each string.

  15. Hmmm, I only skimmed this but it sounds an awful lot like a BTree and there are a number of implementations out there already.

    And you could maybe also consider using something like Sleepycat Db/Java.

  16. Anonymous

    Have you tried using a map with a lower memory footprint? The best I know is from Trove (THashMap).

  17. Hey guys.

    I posted another blog post here:

    http://feedblog.org/2008/03/23/on-hashtables-and-java-storage-efficiency/

    With some stats on per object usage.

    I did benchmark Trove and it was about 30 bytes per entry (which is good but I can trim it down to 12 bytes).

    BerkeleyDB is also an option but then I have to support a new and complicated DB in my system. This code for a FlatMap is only about 150 lines of code.

    Kevin

  18. Also, David.

    It takes 12 bytes of memory for a simple java.util.Object. They must have trimmed it up a bit since my last tests on Java 1.4. It takes 20 bytes for Integer objects vs 4 bytes for ints.

    Just create an Object[] or Integer[] array if you’d like to test this.

    Kevin

  19. Gary, you might want to spend some time reading my blog.

    There’s nothing wrong with my architecture :)

    I know what I’m doing :)

    It’s 10MB worth of static data. Perfectly reasonable to keep it on the client.

  20. I’m afraid your test is still wrong. :-)

    A full Object[n] uses up 4 + word size * n bytes, even when it’s empty. It needs a length field and a pointer for each of its entries – remember the objects are not stored inline in the array! That’s where the additional 4 bytes per object you’re seeing are coming from.

    Your test for Integers are similarly off by 4 bytes, which brings it down to 16 (which my tests confirm). That’s still 4 bytes higher than I’d expect it to be. I don’t know what’s up with that. Could just be some sort of padding for alignment, but I wouldn’t have thought so. I’ll look into it.

  21. Oh, of course. The extra 8 bytes comes from the fact that Integer extends Number. I’d forgotten that quirk of the Java object layout.

  22. Extra 4 bytes. Sigh.

    Incidentally, in your tree map test with byte[] keys you’d be better off providing a custom Comparator to the TreeMap constructor rather than wrapping the byte arrays. You’ll use 12 bytes per entry less memory (Yes, it will still lose. Just saying is all. :-) )

  23. 16 per entry even. 2 words for the java.lang.Object, 1 word for the inheritance, 1 word for the pointer to the array (which is the one I forgot). I seem to keep getting numbers wrong too.

    I’ll stop spamming your blog with mini comments now. :-)

  24. Ok. I lied.

    Looks like I’m wrong about subclasses having an extra word for a parent reference. That’s a relief. :-) But there seems to be some weird padding going on that doesn’t behave like I’d expect.

  25. Hey David…. I’ll try to respond to all your posts at once.

    You’re DEAD RIGHT about using a Comparator in the constructor. I had checked but didn’t see that as an option. I must have been looking at the wrong class… doh.

    I’ll re-run some tests with that option…

    But yeah… I agree that the assumptions one makes per Java object overhead aren’t really straight forward.

    I still don’t think Java will be able to come close to my FlatMap proposal as it’s ideal and uses only 12 bytes per entry.

    Further, it seems most hashtable implementations in Java land seem tailored to saving time not space.

    Kevin

  26. You may have a look at my MemoryMeasurer tool: http://icstrac.ics.forth.gr:3027/hudson/job/MemoryMeasurer/lastSuccessfulBuild/artifact/MemoryMeasurer/trunk/main/dist/MemoryMeasurer.zip

    Just pass -javaagent:path/to/MemoryMeasurer.jar
    And then do:

    long usedMemory = MemoryMeasurer.count(new Object());

    (The tool itself handles correctly recursive structures, arrays, statics, enums etc).

    I concur that the memory overhead of an object is 8 bytes in 32bit arch., and certainly not 38. Also of interest is that due to alignment (I think), fields only add up an 8 bytes cost. That is, 1 or 2 fields need 8 bytes. 3 or 4 fiels need 16 ytes, and so on.

  27. Jeyendran Balakrishnan

    Kevin,

    Have you taken a look at the Hadoop MapFile class? It is a disk based map with sorted keys. A fixed-size [in terms of number of keys] sparse in-memory index is used to implement a [relatively] fast two-stage look up of values based on keys. First the nearest key is looked up via binary search on the in-memory index, and then sequential disk reads are carried out on the disk file starting with this nearest key, to access the matching entry.

  28. Jeyendran,

    Thanks. Good pointer. I’m going to have to take a look.

    I wish I had seen this earlier.

    In fact, while I can’t really use Hadoop and Hbase, I should audit the code again to see if there’s anything I can use in there.

    Though I seem to already have too many things to do :)

    Kevin

  29. Jeyendran Balakrishnan

    Kevin,

    >> Though I seem to already have too many things to do
    I know how that is!

    Good luck with your Hadoop code audit.

    It is possible to use the Hadoop io classes without really using the Hadoop distributed processing [Map Reduce] framework.

    One possible way to implement something like what you have is to first create a disk-based Hadoop SequenceFile [unsorted version of MapFile], then use one of the sort methods [in either SequenceFile or MapFile, I forgot], to convert the SequenceFile into a MapFile.
    The MapFile can then be used as a read-only disk-based lookup table.

    Keep those posts going!

    Cheers:
    Jeyendran

  30. Nice…. that’s very similar to what I need.

    Now I need to figure out if I should throw my FlatMap code out the window.

    Especially since I spent a few hours implementing it already :-/

    Kevin

  31. Maria

    Hi Kevin,

    Sounds like a good idea…. I am working on a similar project to map flat files to in memory. I hope you can send me a link to your project for me to look into it in more detail.

    Thank you in advance,
    Maria






%d bloggers like this: