Thursday, November 19, 2009

Consistent hashing for Mnesia fragments - part 2

Using the consistent hashing scheme I described in my last post, I have created a table with 400 fragments and done a bunch of writes to it. Although data distribution was very good in this test, the performance was really crappy, so I supposed Mnesia was doing some copying around of the hash_state (which contains the tree I use for consistent hashing).

As Ulf Wiger commented here, Mnesia is actually storing the structure in an ets table and looking it up at every operation, what results in a copy of the whole tree even for one read operation!

Not surprisingly, if I decrease the number of entries per fragment in the consistent hashing tree to 10, perfomance becomes good, but then data distribution becomes worse than with the default mnesia_frag_hash.

Lesson learned: if you implemented your own hashing module for Mnesia, be careful with the size of the hash_state, as this record is fully copied at every operation!

But let's not give up consistent hashing yet: an idea that comes naturally is to store the entries in a mutable structure, like ets, and then have a "pointer" to that structure in the hash_state. To avoid losing the fragmentation info, I'm going to create a Mnesia disc_copies table and make raw accesses to it using ets, as suggested by Ulf Wiger.

Things are a little bit less automatic now. I have opted to leave the creation of the consistent hashing table outside of mnesia_frag_chash. For example: if you are going to create a table called person, you first need to create a person_chash table, with the following characteristics:
- NOT fragmented.
- Type: ordered_set.
- Storage: disc_copies.
- Attributes: 2, including the key (minimum required by Mnesia - I don't use the second attribute).
- Replicated in every node that is going to access the person table (for performance reasons).

Here is the code for that example, using one node and 100 fragments:


Not so difficult. If you are thinking of using that on your next system, I can tell you the performance for "normal" write operations (writing/deleting records) will not be affected and that reading records will be a little bit slower (you are reading from an additional ets table before you get your data). Adding and deleting fragments, on the other hand, will take more time. How much more? I could guess some number (say, 100% more time), but that depends on so many factors that I strongly encourage you to do your own tests, in your own scenario.

As for distribution of records among fragments, you will get better results, really. Check this distribution of records after inserting 10 million random short strings using mnesia_frag_chash:
[215275,196101,211857,176795,189177,190949,210562,173166,155642,199126,
174730,201721,185447,224594,176798,165405,192294,206536,210361,201988,
168968,203098,255504,201828, 162979,257371,211712,205513,200564,229616,
211546,218419,179620,210396,234415,202573,192363,198798,171920,181552,
204337,237357,215541,199831,210681,203614,195590,191930,200881,182959].
Average: 200,000.
Variance: 452,627,581 (standard deviation: 21,275).
Biggest fragment: 257,371.

And then the same 10 million records using mnesia_frag_hash:
[155936,155724,155470,155973,156038,155962,156407,156726,156739,156873,
156362,156388,156306,156502,156278,156292,155672,156087,312118,313014,
313462,312910,311754,311992,312984,311970,311902,312021,312190,312429,
312329,313125,156414,155736,156920,156362,156708,155696,156881,156233,
156171,156341,156353,156373,156292,156460,156158,156548,156001,156418].
Average: 200,000.
Variance: 4,917,043,219 (standard deviation: 70,122)
Biggest fragment: 313,462.

Source code is just mnesia_frag_chash.erl and you can find more useful stuff here.

If you want to run your tests with your own data, send me a private message: maybe I have some useful functions for you.

No comments:

Post a Comment