Friday, May 21, 2010

Mnesia - one year later (part 2)

This is my second post about the not-so-well-known relevant facts I have learned by working on a project based on Mnesia. Check my last post before reading this one. Let's go...

Do not use mnesia_frag to list records
Suppose your foo table, properly fragmented and distributed across dozens of servers, contains a secondary index on its second field. I guess you will eventually want to list all the records that have some bar value on that indexed field, since that's the reason for creating indices (;-)). After checking some Mnesia tutorial, you realize it is ridiculously easy to make that query. Here is an example:
mnesia:activity(..., fun qlc:e/1, [qlc:q([K || #foo{key = K, field2 = bar, _ = '_'} <- mnesia:table(foo)])], mnesia_frag).

The operation above works well. But what happens if one of the fragments of foo is unavailable (e.g. because one of your servers is down)? What Mnesia does is what a generic DBMS should do: it throws an exception. However, depending on your application's domain, there's a good chance that what you want is to retrieve all of the available records (in other words: you may prefer a partial list than no data at all). This issue was pointed out by one of my colleagues (Ivan) a long time ago.

Mnesia gives you all the necessary tools to list data from a fragmented table in different ways:
  • You can check the fragmentation properties of the foo table (including its number of fragments) with mnesia:table_info(foo, frag_properties).
  • You can access fragments individually (they are just tables) by appending a suffix to the table's name. Fragment 1 is called foo, fragment 2 is called foo_frag2, fragment 10 is called foo_frag10 and so on.
  • You can locate the nodes that hold, for instance, foo_frag14 by calling mnesia:table_info(foo_frag14, where_to_write).
  • You can easily make remote procedure calls to other nodes (it's Erlang...).
Just put all of the above tools together to make the best solution for your case.

There is no sorted iteration of fragmented tables
This is very important to keep in mind. If your tables don't fit in memory and, as a consequence, you are using disc_only_copies, you don't have sorted data inside each fragment (since dets tables aren't sorted) and you also don't have sorted data between fragments (remember? It's linear hashing). Those facts imply that, if you have 100 million records in your fragmented table and you just need to retrieve the smallest key, you absolutely need to read 100 million entries from storage. Of course, in this example, you can keep the desired result elsewhere and even try to keep it updated. But, if you need to find an arbitrary interval of keys considering the table in ascending order, the problem starts to get hard.

If your full table fits in RAM and its fragments are disc_copies or ram_copies, you are in better shape: those storage types support the ordered_set table type. But you still don't have sorting between the fragments (linear hashing again), what makes it necessary to read something from every fragment whenever you need to perform an operation that depends on global ordering.

If you need globally sorted keys (and your fragmented table fits in RAM), maybe you can try to implement another hashing module (according to the mnesia_frag_hash behaviour) to keep global ordering. However, you need to be careful with the distribution of data between fragments (predict your keys, indeed) and perhaps consider what the performance will be when you raise or decrease the number of fragments (how much data will need to be moved between fragments?). There's no way to tell Mnesia to "split" some fragment that has grown too much (like Bigtable does with its tablets), so keeping global ordering is not easy.

Adopt the key-value approach
Mnesia records are Erlang records. By default, you have all of the comfort of the Erlang record syntax at your disposal when writing code for Mnesia. When you create a user table with fields nickname, email, and name, its entries will be tuples like this: {user, "igorrs", "igorrs@example.org", "Igor"}. When working with that table, you'll be able to use an Erlang record called user to write very clean code. For instance, you can retrieve the name field of the record whose key is Nickname with this call:
[#user{name = Name}] = mnesia:dirty_read({user, Nickname}).

So, now, you have the beautiful code above running in your big pool of servers and someone asks you to add the gender field to the user table. Easy: that's what the function mnesia:transform_table/3 was made for. You just have to run it and... watch your code break. But wait! You can also update the record definition in your code, so that it doesn't break! You surely can do that, once all of the records on your big table with millions of entries are updated. And please don't run any code (old or new) while the table is being transformed.

Conclusion: basically, if you want to use that convenient Erlang record syntax, you have to agree to stop your whole system every time you need to add a field to a table, what's a very big price to pay. If you want to avoid it, you may try to write an ugly style of code that expects the shape of the records to change. I've never used this technique, but it should work.

My favorite way of dealing with the "column-addition" problem, however, is to use a fixed record with only two fields: one for the key and one for the value, which is an Erlang term (which I like to treat as a property list). With this strategy, the record fields for the table will never change. In the above example, you would probably end up with entries like this: {user, "igorrs", [{"e-mail", "igorrs@example.org"}, {"name", "Igor"}]}. You can handle the value field very naturally with the lists:key* family of functions, for example.

There are a couple of disadvantages in the key-value approach I've just described: it wastes some space (to identify the "columns") and it complicates the creation of Mnesia secondary indices (you would have to create another record field and deal with the related problems). In exchange, you get flexible code that supports transformations to the values.

Whatever approach you decide to adopt, just don't forget that you may always need to add columns. Hence, you should write all of your code with that plan in mind.

To be continued...

No comments:

Post a Comment