Thursday, May 20, 2010

Mnesia - one year later

I've been working for a little more than a year now on a private back end project at my company, making heavy use of Erlang and Mnesia to build a distributed system that stores high volumes of data. With the system in production for some months and working really fine since its first deployment, I decided it was a good moment to try to remember the most important lessons I have learned about Mnesia, especially those that are not explicitly stated anywhere else.

Before I start, I'd like to repeat what many people know nowadays: Erlang is an absolutely marvelous language. I'm extremely grateful to Ericsson and the whole Erlang community, having no relevant complaints with respect to Erlang and the OTP (except for logging, but you can easily find long discussions on the subject, so I'm not talking about it here). I won't try to give you any advice with respect to the language itself, since there's already great material available here and there to lead you to a successful experience with Erlang.

Mnesia is also a great tool for developing distributed, scalable storage systems. It gives you data partitioning, replication, ACID transactions and many more important characteristics that just work and - you can bet it - are stable. I'm sure you can find - as I did - lots of good advices on using Mnesia, so I'll try not to repeat too much of what other people have already said. Obviously, I'm also not giving a tutorial on the basics of Mnesia here: I assume you are going to study elsewhere in case you decide to actually use Mnesia. Here is a good quick overview of table fragmentation (partitioning), for example.

A last warning: some of what I'm saying below may be wrong (you know... I didn't develop Mnesia) or might just not apply to every Mnesia database deployment, since my experiences are mostly based on a system with the following characteristics:
  • Distributed, running on several servers.
  • Employing fragmented tables, distributed across several servers.
  • Employing cloned (replicated) fragments.
  • Holding high volumes of data that don't fit in RAM.
  • Serving a big number of read requests per second (and also some write requests).
Here we go...

Mnesia's table fragmentation uses linear hashing
You decided to create a fragmented table and distribute the data among 3 servers, so that about 1/3 of the data goes to each server. You then created the table with 6 fragments and placed 2 fragments on each of your servers, what gives you uniformly distributed data, right? No!

This is an information that I could only find in the source code (and by "source code" I don't mean "comments"  ;-)): Mnesia's default hashing module (mnesia_frag_hash) uses linear hashing to distribute keys among fragments. That hashing scheme does a perfect job to make the number of fragments cheaply expansible and shrinkable, but it doesn't distribute data well if the number of fragments is not a power of 2.

In the example above, with 6 fragments, you are very likely to have a data distribution that resembles these quantities of keys per fragment: (X, X, 2X, 2X, X, X). If you distribute those 6 fragments among 3 servers naively, you may end up having one server with 100% more data than each one of the other!

The best solution is probably to create a bigger number of fragments that is a power of 2 and to then distribute those fragments as evenly as possible. If the math still doesn't work for your pool, you can think of implementing another hashing module (Mnesia supports that), possibly with consistent hashing.

Load-balancing read operations: do it yourself
You have created a foo table (or table fragment, for that matter), with copies on servers bar and baz, and you are reading that table's records from a number of different Erlang nodes in the same Mnesia pool. You read data by calling mnesia:activity(..., ..., ..., mnesia_frag), so that Mnesia transparently finds the data on one of the servers that hold it.

The example above is totally functional, but Mnesia won't load-balance those read operations: given a steady Mnesia pool, each node will always read the foo table from the same server (always bar or always baz). I've seen cases where all of the servers would read the foo table from the bar node (except baz, since local tables are always read locally, it seems). You can check where node bam is reading the foo table from by calling mnesia:table_info(foo, where_to_read) on node bam.

If you need the read operations to be well distributed among all clones of each table or fragment, you need to explicity set the where_to_read property on each node of the Mnesia pool, for each remote table it needs to access. I like to do this by selecting a random node from the list of hosting nodes (and repeating that process often, on each node, for each remote table), but you could want to choose some other strategy (for example: you might have 3 clones of a table and decide to have all the pool reading data always from the same 2 clones - who knows?). The important information here is that Mnesia will certainly not do what you want automatically.

To be continued...

8 comments:

  1. Looking forward to your next article.

    ReplyDelete
  2. > Looking forward to your next article

    Same here. I'm a total newbie to erlang - teach me yoda :-).

    ReplyDelete
  3. Igor,
    How does the management of your cluster look? Do you perform any DBA tasks, and if so, what do you need to do? What about backups? Is it even necessary to backup, since you have redundancy in the replication?

    Thanks,
    Marcel

    ReplyDelete
  4. Marcel, my personal opinion is that backups are an "extra" that should be used mostly to prevent from software bugs.
    Against other kinds of mishaps, I prefer to rely on redundancy, since backups need recovery procedures and put the system on an old state.
    However, backups are cheaper.

    ReplyDelete
  5. Thanks for the reply! That makes sense.
    Another question: data segregation.
    Imagine that I have a multi-tenant app and I need to segregate their data. Imagine I have "Products" table and my "tenants/clients" are retail stores. Would it be possible to give each client a fragment of the "Products" table? So, all Acme products always gets written to the Acme fragments, of which there might be 3, to ensure redundancy? I guess the real question is: out of say 100 fragments, can I write data to only 3 of them? I thought that the hashing algorithm might ensure that you read or write to and from those tables consistently?

    What are your thoughts? Would this be using fragments in an unintended way?

    Regards,
    Marcel

    ReplyDelete
  6. Hi, Marcel.
    With Mnesia's default hashing, I don't think you can segregate data that way (with a different subset of fragments for each client). Maybe it's possible by implementing your own hashing scheme and having the client included in the key, but, if you don't know the number of clients in advance, it gets more complicated: maybe you'll need a table to hold the list of clients and, based on that, you would add fragments to the table that actually holds the data.
    However, going for that solution is not better than creating a new (fragmented) table each time you have a new client. This latter scheme should also make it easier to store each client's data in a different storage (which is the point of segregating data, I guess).

    ReplyDelete
  7. You mentioned setting the where_to_read node manually. However, I couldn't find anything in the doc about this? How do you do it?

    ReplyDelete
    Replies
    1. I don't think there is a documented way to do that, but it can be done by using the mnesia_lib module directly:

      mnesia_lib:set({Table, where_to_read}, Node).
      Where Table can be an individual fragment.

      Of course, you should do this with care. For example: I suspect that performing it concurrently with mnesia:move_table_copy/3 may cause fragment copies to vanish!

      Today, I have processes calling mnesia_lib:set/2 frequently in production, but I make sure this is never done if there are any schema changes going on.

      Delete