Java and In-Memory Data Structures – A Flat File Map Proposal
In the past I’ve often used in-memory data structures (vs on disk) in situations where allocating say 5-10MB of data in the local VM is much better than continually hitting a database.
This has only worked because I’m not using very large data structures. My language classification codes uses about 8MB of memory stored in a TreeMap for it’s trained corpus based on Wikipedia.
Recently, I’ve been playing with a much larger corpus for a new feature for Spinn3r.
It originally started off using 500MB which is WAY too much. Bumping it down to 350MB by using a TreeMap was a start but that’s no where NEAR far enough.
What’s happening is that Java itself has per-object overhead. In my analysis it takes 38 bytes for just a basic java.lang.Object.
This is way too much data. If you’re using 8 byte keys (64bit ints) then using 46 bytes to represent them is nearly 6x additional overhead.
In my situation I’m nearly seeing 10x overhead. This 350MB memory image of my data structure can be represented as just 35MB on disk. This would be more than fine for my application.
Which is when I had an idea – why am I using a pre-indexed data structure at all? Why am I keeping millions of copies of java.lang.String and java.lang.Integer instances?
Here’s what I’m going to do. I’m going to write an in-memory memory mapped file implementation of java.util.Map. Using the new NIO support for mmap files I’m going to write out a file with all the keys sorted and followed by its data.
So the on-disk format will like:
I read the number of keys and then a perform a binary search on the key region for my key base on it’s byte encoding.
The first pass will only work on fixed width keys and values but it should be easy to add variable width (though this might require a bit more overhead.
This should yield a 10x memory savings with only a slight CPU hit while converting the key to a byte array and parsing the data byte array into an actual java primitive.