Asynchronous Java Driver Blog     About     Archive     Feed

Caching Strings

Why Cache Strings

If you have ever looked at a heap dump of your MongoDB application that is reading and writing lots of documents you will see a huge number of String instances. They are everywhere and seem to multiply like rabbits. While the values within each documents, there is a expectation, will be unique there are still all of those copies of the field names in all of those documents. Even if there are only 10 unique field names every time you read a document, you get another copy of the field names. Read 1,000 documents/second and you generate 10,000 duplicate strings per second.

The truth is that most of those strings will live a very short period of time and be quickly cleaned up by the garbage collector. All of those duplicate objects and the backing char[]’s cause extra memory pressure and more frequent garbage collection cycles. What we need is to remove the duplicate strings as much as possible.

A Naive Cache

We could take the approach that all of the field names should simply be interned and have the JVM manage the cache.

private Charset UTF-8 = Charset.forName("UTF-8");

public static String cacheLookup(String fieldNameBytes, 
                                 int offset, int length) {
    return new String(fieldNameBytes, offset, length, UTF-8)
                 .intern();
}

This could actually work for a system that does not use dynamic field names. The problem is the driver has no way of knowing which document field names are dynamic. We would also like the cache to support the caching of field values. There are many cases where the values are one of a small number of possible values (think of an enumeration of states) and caching those would also be beneficial. Lastly, the naive cache still places extra pressure on the garbage collector since we have to create the string to be intern()’d.

pre-2.0: The Static Cache

In the pre-2.0 version of the driver there were a couple of places where the driver used a static cache of the String to the encoded bytes. The cache was statically loaded with entries for all of the ASCII characters a-z, A-Z and 0-9.

private Map<Byte, String> CACHE;

public static String cacheLookup(byte[] fieldNameBytes, 
                                 int offset, int length) {
    if( length == 0 ) {
        return "";
    } else if ( length == 1 ) {
        byte value = fieldNameBytes[offset]
	return CACHE.get(Byte.valueOf(value));
    }
    return null;
}

This is an easy cache to implement and validate. The problem is the cache was only effective for applications that only used one character field names and values. One of the guiding principles on our team is that software will be read many more times than it is ever written or modified and so we should always attempt to increase clarity whenever possible. Using one character field names runs completely counter to that principle.

2.0.0: The Linked Trie

For the 2.0.0 release we wanted to do better so we set out to design a better cache. The end result was a cross between a trie and a LinkedHashMap.

private int myMaxCacheLength;
public static String cacheLookup(byte[] fieldNameBytes, 
                                 int offset, int length) {
    if( length == 0 ) {
        return "";
    } else if ( length < myMaxCacheLength ) {
        TrieNode current = myRoot;

        for( int i = 0; (current != null) && i < length; ++i ) {
            current = current.child( fieldNameBytes[ offset + i ] );
            if( current != null ) {
                current.used(); // Remove and add to front 
                                // of the LRU list.
            }
        }

        if( current != null ) {
            return current.value();
        }
    }
    return null;
}
private class TrieNode {
    // Lots of pointers
}

The basic data structure was based on the trie that was built from the byte sequence for the string. To track what nodes in the trie are oldest we used the same model as is used by the LinkedHashMap. The model created a doubly linked list through all of the nodes within the trie. As we access each node we remove it from its current position in the list and add it the the front of the list. When we have too many entries in the trie we simply remove the entry from the end of the list as it is the LRU (Least Recently Used).

This is obviously a more complicated data structure and cache to maintain. The property that we liked the most was that the cache would automatically learn the strings that should be cached as it was being used. If the application’s usage of documents shifted over time so would the cached set of strings. Due to the number of pointers we have to maintain it is impossible to keep everything consistent without holding a lock. If we had made the cache concurrent with a single lock we would have put a huge bottleneck in the center of the driver. That was not acceptable. The tradeoff we made instead was to have each of the connections maintain its own cache. That allowed the caches to avoid the locks but still gave us the automatic cache of strings.

2.0.0 Release!

The release of 2.0.0 goes out the door and pretty quick the reports start coming in about excess memory use and they are supported by heap stats showing a large number of StringDecoder$Node (the trie node class) objects and arrays. While the number of strings within the applications are better we just shifted the memory pressure to a new class. When we dug deeper into the applications configuration we dicovered they had been using the driver with 100’s and in one case 1,000 of connections. We had designed the driver to keep the connection count low but apparently that message got lost. We worked with some users to reduce the connection count but it was pretty obvious we needed to do something to fix the cache. Things had gotten worse not better.

Taking a Step Back

The new cache was actually working as designed. The String object counts in the applications were way down. The issue was the overhead of determining what strings to keep in the cache were hurting the ability of the application to leverage a single cache (per MongoClient). To enable that we needed to make the cache concurrent.

2.0.1: The Concurrent Learning Cache

The strategy to make the concurrent cache was actually fairly simple. We broke the process of learning what entries to cache from the process of doing lookups in the cache. Once the cache has decided on the entries to cache it can create a read-only (e.g., no locking required) data structure and replace the existing cache with the new data structure using the wasp nest or RCU (Read-Copy-Update) pattern.

private int myMaxCacheLength;
public static String cacheLookup(byte[] fieldNameBytes, 
                                 int offset, int length) {
    if( length == 0 ) {
        return "";
    } else if ( length < myMaxCacheLength ) {
        TrieNode current = myRoot;

        for( int i = 0; (current != null) && i < length; ++i ) {
            current = current.child( fieldNameBytes[ offset + i ] );
        }

        if( current != null ) {
            return current.value();
        }
    }
    return null;
}
private class TrieNode {
    // Pointers to children only!
}

By breaking the cache logic in two we can use the best data structure for each. We stick with the trie for the read-only cache. For building up the state we use a simple ConcurrentHashMap<String,SeenString>. Each SeenString contains the cached String, byte[], and a count field. We use an AtomicIntegerFieldUpdater to increment the count field without holding a lock.

The cache will trim entries that are not accumulating enough counts every time it sees the N uses of a string. N is currently the same as the maximum number of cached strings. Every 100 x N uses of any string will trigger the rebuild and replacement of the look-up cache.

The new cache takes longer to respond to changes in the strings the application is reading but is now thread safe and we can use a single cache across all of the driver’s connections. We reused the same configuration fields for the new cache but since there is only one instance of the cache we needed to make other changes to make the configuration observable. That allows you to now dynamically change the number of cached entries etc. Lastly, we have spent most of this blog talking about caching the decoded strings read by the driver. We have applied similar logic to cache and speed up the encoding of string written by the driver.