Thursday, November 19, 2009

Consistent hashing for Mnesia fragments - part 2

Using the consistent hashing scheme I described in my last post, I have created a table with 400 fragments and done a bunch of writes to it. Although data distribution was very good in this test, the performance was really crappy, so I supposed Mnesia was doing some copying around of the hash_state (which contains the tree I use for consistent hashing).

As Ulf Wiger commented here, Mnesia is actually storing the structure in an ets table and looking it up at every operation, what results in a copy of the whole tree even for one read operation!

Not surprisingly, if I decrease the number of entries per fragment in the consistent hashing tree to 10, perfomance becomes good, but then data distribution becomes worse than with the default mnesia_frag_hash.

Lesson learned: if you implemented your own hashing module for Mnesia, be careful with the size of the hash_state, as this record is fully copied at every operation!

But let's not give up consistent hashing yet: an idea that comes naturally is to store the entries in a mutable structure, like ets, and then have a "pointer" to that structure in the hash_state. To avoid losing the fragmentation info, I'm going to create a Mnesia disc_copies table and make raw accesses to it using ets, as suggested by Ulf Wiger.

Things are a little bit less automatic now. I have opted to leave the creation of the consistent hashing table outside of mnesia_frag_chash. For example: if you are going to create a table called person, you first need to create a person_chash table, with the following characteristics:
- NOT fragmented.
- Type: ordered_set.
- Storage: disc_copies.
- Attributes: 2, including the key (minimum required by Mnesia - I don't use the second attribute).
- Replicated in every node that is going to access the person table (for performance reasons).

Here is the code for that example, using one node and 100 fragments:


Not so difficult. If you are thinking of using that on your next system, I can tell you the performance for "normal" write operations (writing/deleting records) will not be affected and that reading records will be a little bit slower (you are reading from an additional ets table before you get your data). Adding and deleting fragments, on the other hand, will take more time. How much more? I could guess some number (say, 100% more time), but that depends on so many factors that I strongly encourage you to do your own tests, in your own scenario.

As for distribution of records among fragments, you will get better results, really. Check this distribution of records after inserting 10 million random short strings using mnesia_frag_chash:
[215275,196101,211857,176795,189177,190949,210562,173166,155642,199126,
174730,201721,185447,224594,176798,165405,192294,206536,210361,201988,
168968,203098,255504,201828, 162979,257371,211712,205513,200564,229616,
211546,218419,179620,210396,234415,202573,192363,198798,171920,181552,
204337,237357,215541,199831,210681,203614,195590,191930,200881,182959].
Average: 200,000.
Variance: 452,627,581 (standard deviation: 21,275).
Biggest fragment: 257,371.

And then the same 10 million records using mnesia_frag_hash:
[155936,155724,155470,155973,156038,155962,156407,156726,156739,156873,
156362,156388,156306,156502,156278,156292,155672,156087,312118,313014,
313462,312910,311754,311992,312984,311970,311902,312021,312190,312429,
312329,313125,156414,155736,156920,156362,156708,155696,156881,156233,
156171,156341,156353,156373,156292,156460,156158,156548,156001,156418].
Average: 200,000.
Variance: 4,917,043,219 (standard deviation: 70,122)
Biggest fragment: 313,462.

Source code is just mnesia_frag_chash.erl and you can find more useful stuff here.

If you want to run your tests with your own data, send me a private message: maybe I have some useful functions for you.

Monday, November 16, 2009

Consistent hashing for Mnesia fragments

If you work with Mnesia and use the mnesia_frag module for partitioning tables, maybe you know that Mnesia's default hashing module deals very well with the situation in which you need to add one more fragment to your table. Actually, the rehashing scheme has two important properties:
  1. No data will need to be moved to any fragment that is not the new one.
  2. Only the keys in one of the old fragments have to be checked (and possibly moved to the new fragment).

But, there are also two characteristics that you might not want for your application:
  1. Data is not very well distributed among the table fragments.
  2. If you are using disc_only_copies and one of your fragments is reaching the 2 GB limit, maybe you will need to add a lot of fragments to make it shrink (this is, in a way, a consequence of advantage 2, above).
As Mnesia allows you to change the hashing scheme without having to patch OTP's code, I believe there's an easy way to overcome the two disadvantages above and keep advantage 1 (though losing part of advantage 2), by implementing a form of consistent hashing for distribution of data between table fragments.

I have implemented that, in a module called mnesia_frag_chash, along with a modified version of Erlang's gb_sets module. The key functionality is the geq_iterator that Richard O'Keefe provided in this post. I named the modified module ok_gb_sets.

I'm creating 100 entries for each fragment, calculating a hash value for each entry and then storing each one in the circular hash table for consistent hashing (which is actually a tree - an ok_gb_sets). To find the fragment for a specific key, all I have to do is calculate a hash value H for the key, find the first element whose hash value is greater or equals H and than pick that entry's fragment (actual implementation is just a little bit more complicated, as we have to avoid collisions):



You could use a smaller number of entries per fragment: that would make the tree smaller and you would need to check a smaller number of fragments when adding a new one. On the other hand, creating more entries per fragment would provide even better distribution of data between fragments.

The hash state is defined as follows, where chash_table is the ok_gb_sets and n_frag_pieces is the number of entries created in the set for each fragment:



My main concern is about the size of chash_table. I need to understand Mnesia's code better to be sure that the use of this hash_state will not cause performance problems.

Another problem introduced is that, when adding a new fragment, the keys in several fragments (100, in the worst case, but that can be tuned, as I said above) will have to be scanned for Mnesia to find out which ones need to be moved to the new fragment. Although the total amount of data to be moved will still be small, the number of fragments locked and the amount of verified keys will be considerably bigger, so beware.

If my first concern turns out to be unimportant and the second one is not a big problem for you, you should consider using this mnesia_frag_chash, as it solves the two problems I discussed in the beginning of this post (I have executed some simple experiments and it seems a good job is done on distributing the data).

Full source code (just two files) is here.