Sunday, May 23, 2010

Mnesia - one year later (part 3)

This is the third part of my series of posts about Mnesia. From this third post on, I'll probably start to include some information that you can find elsewhere without too much trouble. After all, it's good to have them arranged together. I'm also talking about stuff that might be a little bit obvious for some people, like this next one...

Automate table management
As you certainly know, once tables are scaled horizontally, you will need several administration procedures that a traditional RDBMS won't. If your pool of servers deals with no more than some tens of fragments for each table, maybe you can still manage the data distribution by directly calling functions like mnesia:move_table_copy/3mnesia:del_table_copy/2, mnesia:add_table_copy/3, mnesia:change_table_frag/2, etc. But, once you arrive at the hundreds of fragments, it becomes hard to build the desired topology for you pool manually, especially if you have at least three servers and expect the pool to grow eventually.

I can't tell you how to automate the management of the tables' fragments in general (and Mnesia also can't do it for you), because that's totally dependent on how you need your pool to be built, considering characteristics like:
  • How many copies does each table need?
  • What's the span of each table across the pool?
  • Do you need any colocation for keys on different tables?
Once you decide what kinds of replication and partitioning you need for each of your tables, the functions above, together with the tools I described on my last post, will be almost everything you will use to administer the fragments automatically.

The number of fragments should be just "big enough"
In this section, I will only talk about disc_only_copies tables, since I haven't experimented intensely with disc_copies. Nevertheless, some of my comments may also apply to other kinds of tables (e.g. you obviously can run performance tests with disc_copies tables too).

The important (and somewhat obvious) message that shouldn't be forgotten is: with dets (Mnesia's disc_only_copies), you lose performance when you raise the number of fragments, even if the total amount of data doesn't change. If you create too many fragments, more resources (more disks, for example) will be required by Mnesia to keep the throughput. So, if you think you will turn the dets limits into a non-issue just by "creating a million fragments and never thinking about it again", you're wrong.

As you may know, with disc_only_copies, you have a 2 GB limit on each fragment, so, at the very minimum, you should have enough fragments to keep each one with less than 2 GB. But people will also tell you that the throughput on a fragment will fall drastically way before the fragment's size approaches 2 GB. Thus, you need to create enough fragments to keep the size of each one below a given "performance threshold". Fine, but what's that threshold?

I will only harm you if I try to teach you numbers, with respect to good sizes for dets tables, since you can find your numbers by yourself and you should do that. Just open an Erlang shell, create a dets table (dets:open_file/2), insert some of your data in it, perform insert, read, and delete operations, insert more data, perform more operations and so on, until you can draw a good performance curve that applies to the kind of data you're dealing with.

In short, these are my general advices:
  • Test dets with your data, considering each of the main tables you will need. This is the way to determine the maximum size (or maximum number of records) you can have in each fragment of each table without losing too much performance. Don't forget to test all kinds of operations (read, write, and delete), with an emphasis on what you need to do more often.
  • Create enough fragments to keep you comfortable (with a high probability) for a long time.
  • Don't forget 2 GB is the breaking point. Never allow any disc_only_copies fragment to reach that limit!
  • If your fragments grow too much, add more fragments. Adding one fragment is not so expensive, since Mnesia uses linear_hashing, and you don't need to stop Mnesia to do that. However, with linear hashing, you will need to double the number of fragments to make every one of them shrink (you are already using a power of 2, right?  ;-)). Therefore, you must start adding fragments as soon as possible.
Be careful with the transaction log
This is just a reference. The main idea here is: just because Mnesia is giving you the required throughput for write operations, it doesn't mean it's not overloaded. Follow this link to learn more about this issue. Some searches will also point you to other sources of information.

Mnesia copies entire fragments from the replicas when it starts
If you have cloned tables (or table fragments) and, for example, you restart your node cool_node@interwebs, it will copy each modified table (considering each fragment as a table, of course) entirely from some other node to cool_node@interwebs. That fact implies that, if you have a node down for a long time and then you restart it, you might see a huge amount of data being read in remote nodes and a big raise in network traffic, until the new node is fully consistent again.

There's not very much you can do to make this "synchronization" procedure better. I suggest you play with two Mnesia options that may make this issue less painful: no_table_loaders and send_compressed. The first one regulates the maximum "rate" for the copies and the second one activates compression (thus reducing network traffic and raising CPU utilization). You can find more information about them in the documentation for the mnesia module.

To be continued (I guess)...


  1. This is a great series, looking forward to future articles.

  2. The ideas you share helped me very much, waiting for the new info.. Thank You

  3. How you repare tables from network partitions?

  4. We adopted a reasonably simple approach that chooses strong consistency over availability. I guess it would be much harder to implement eventual consistency over Mnesia, but I agree it would be the best option, for most applications.