Sunday, May 23, 2010

Mnesia - one year later (part 3)

This is the third part of my series of posts about Mnesia. From this third post on, I'll probably start to include some information that you can find elsewhere without too much trouble. After all, it's good to have them arranged together. I'm also talking about stuff that might be a little bit obvious for some people, like this next one...

Automate table management
As you certainly know, once tables are scaled horizontally, you will need several administration procedures that a traditional RDBMS won't. If your pool of servers deals with no more than some tens of fragments for each table, maybe you can still manage the data distribution by directly calling functions like mnesia:move_table_copy/3mnesia:del_table_copy/2, mnesia:add_table_copy/3, mnesia:change_table_frag/2, etc. But, once you arrive at the hundreds of fragments, it becomes hard to build the desired topology for you pool manually, especially if you have at least three servers and expect the pool to grow eventually.

I can't tell you how to automate the management of the tables' fragments in general (and Mnesia also can't do it for you), because that's totally dependent on how you need your pool to be built, considering characteristics like:
  • How many copies does each table need?
  • What's the span of each table across the pool?
  • Do you need any colocation for keys on different tables?
Once you decide what kinds of replication and partitioning you need for each of your tables, the functions above, together with the tools I described on my last post, will be almost everything you will use to administer the fragments automatically.

The number of fragments should be just "big enough"
In this section, I will only talk about disc_only_copies tables, since I haven't experimented intensely with disc_copies. Nevertheless, some of my comments may also apply to other kinds of tables (e.g. you obviously can run performance tests with disc_copies tables too).

The important (and somewhat obvious) message that shouldn't be forgotten is: with dets (Mnesia's disc_only_copies), you lose performance when you raise the number of fragments, even if the total amount of data doesn't change. If you create too many fragments, more resources (more disks, for example) will be required by Mnesia to keep the throughput. So, if you think you will turn the dets limits into a non-issue just by "creating a million fragments and never thinking about it again", you're wrong.

As you may know, with disc_only_copies, you have a 2 GB limit on each fragment, so, at the very minimum, you should have enough fragments to keep each one with less than 2 GB. But people will also tell you that the throughput on a fragment will fall drastically way before the fragment's size approaches 2 GB. Thus, you need to create enough fragments to keep the size of each one below a given "performance threshold". Fine, but what's that threshold?

I will only harm you if I try to teach you numbers, with respect to good sizes for dets tables, since you can find your numbers by yourself and you should do that. Just open an Erlang shell, create a dets table (dets:open_file/2), insert some of your data in it, perform insert, read, and delete operations, insert more data, perform more operations and so on, until you can draw a good performance curve that applies to the kind of data you're dealing with.

In short, these are my general advices:
  • Test dets with your data, considering each of the main tables you will need. This is the way to determine the maximum size (or maximum number of records) you can have in each fragment of each table without losing too much performance. Don't forget to test all kinds of operations (read, write, and delete), with an emphasis on what you need to do more often.
  • Create enough fragments to keep you comfortable (with a high probability) for a long time.
  • Don't forget 2 GB is the breaking point. Never allow any disc_only_copies fragment to reach that limit!
  • If your fragments grow too much, add more fragments. Adding one fragment is not so expensive, since Mnesia uses linear_hashing, and you don't need to stop Mnesia to do that. However, with linear hashing, you will need to double the number of fragments to make every one of them shrink (you are already using a power of 2, right?  ;-)). Therefore, you must start adding fragments as soon as possible.
Be careful with the transaction log
This is just a reference. The main idea here is: just because Mnesia is giving you the required throughput for write operations, it doesn't mean it's not overloaded. Follow this link to learn more about this issue. Some searches will also point you to other sources of information.

Mnesia copies entire fragments from the replicas when it starts
If you have cloned tables (or table fragments) and, for example, you restart your node cool_node@interwebs, it will copy each modified table (considering each fragment as a table, of course) entirely from some other node to cool_node@interwebs. That fact implies that, if you have a node down for a long time and then you restart it, you might see a huge amount of data being read in remote nodes and a big raise in network traffic, until the new node is fully consistent again.

There's not very much you can do to make this "synchronization" procedure better. I suggest you play with two Mnesia options that may make this issue less painful: no_table_loaders and send_compressed. The first one regulates the maximum "rate" for the copies and the second one activates compression (thus reducing network traffic and raising CPU utilization). You can find more information about them in the documentation for the mnesia module.

To be continued (I guess)...

Friday, May 21, 2010

Mnesia - one year later (part 2)

This is my second post about the not-so-well-known relevant facts I have learned by working on a project based on Mnesia. Check my last post before reading this one. Let's go...

Do not use mnesia_frag to list records
Suppose your foo table, properly fragmented and distributed across dozens of servers, contains a secondary index on its second field. I guess you will eventually want to list all the records that have some bar value on that indexed field, since that's the reason for creating indices (;-)). After checking some Mnesia tutorial, you realize it is ridiculously easy to make that query. Here is an example:
mnesia:activity(..., fun qlc:e/1, [qlc:q([K || #foo{key = K, field2 = bar, _ = '_'} <- mnesia:table(foo)])], mnesia_frag).

The operation above works well. But what happens if one of the fragments of foo is unavailable (e.g. because one of your servers is down)? What Mnesia does is what a generic DBMS should do: it throws an exception. However, depending on your application's domain, there's a good chance that what you want is to retrieve all of the available records (in other words: you may prefer a partial list than no data at all). This issue was pointed out by one of my colleagues (Ivan) a long time ago.

Mnesia gives you all the necessary tools to list data from a fragmented table in different ways:
  • You can check the fragmentation properties of the foo table (including its number of fragments) with mnesia:table_info(foo, frag_properties).
  • You can access fragments individually (they are just tables) by appending a suffix to the table's name. Fragment 1 is called foo, fragment 2 is called foo_frag2, fragment 10 is called foo_frag10 and so on.
  • You can locate the nodes that hold, for instance, foo_frag14 by calling mnesia:table_info(foo_frag14, where_to_write).
  • You can easily make remote procedure calls to other nodes (it's Erlang...).
Just put all of the above tools together to make the best solution for your case.

There is no sorted iteration of fragmented tables
This is very important to keep in mind. If your tables don't fit in memory and, as a consequence, you are using disc_only_copies, you don't have sorted data inside each fragment (since dets tables aren't sorted) and you also don't have sorted data between fragments (remember? It's linear hashing). Those facts imply that, if you have 100 million records in your fragmented table and you just need to retrieve the smallest key, you absolutely need to read 100 million entries from storage. Of course, in this example, you can keep the desired result elsewhere and even try to keep it updated. But, if you need to find an arbitrary interval of keys considering the table in ascending order, the problem starts to get hard.

If your full table fits in RAM and its fragments are disc_copies or ram_copies, you are in better shape: those storage types support the ordered_set table type. But you still don't have sorting between the fragments (linear hashing again), what makes it necessary to read something from every fragment whenever you need to perform an operation that depends on global ordering.

If you need globally sorted keys (and your fragmented table fits in RAM), maybe you can try to implement another hashing module (according to the mnesia_frag_hash behaviour) to keep global ordering. However, you need to be careful with the distribution of data between fragments (predict your keys, indeed) and perhaps consider what the performance will be when you raise or decrease the number of fragments (how much data will need to be moved between fragments?). There's no way to tell Mnesia to "split" some fragment that has grown too much (like Bigtable does with its tablets), so keeping global ordering is not easy.

Adopt the key-value approach
Mnesia records are Erlang records. By default, you have all of the comfort of the Erlang record syntax at your disposal when writing code for Mnesia. When you create a user table with fields nickname, email, and name, its entries will be tuples like this: {user, "igorrs", "igorrs@example.org", "Igor"}. When working with that table, you'll be able to use an Erlang record called user to write very clean code. For instance, you can retrieve the name field of the record whose key is Nickname with this call:
[#user{name = Name}] = mnesia:dirty_read({user, Nickname}).

So, now, you have the beautiful code above running in your big pool of servers and someone asks you to add the gender field to the user table. Easy: that's what the function mnesia:transform_table/3 was made for. You just have to run it and... watch your code break. But wait! You can also update the record definition in your code, so that it doesn't break! You surely can do that, once all of the records on your big table with millions of entries are updated. And please don't run any code (old or new) while the table is being transformed.

Conclusion: basically, if you want to use that convenient Erlang record syntax, you have to agree to stop your whole system every time you need to add a field to a table, what's a very big price to pay. If you want to avoid it, you may try to write an ugly style of code that expects the shape of the records to change. I've never used this technique, but it should work.

My favorite way of dealing with the "column-addition" problem, however, is to use a fixed record with only two fields: one for the key and one for the value, which is an Erlang term (which I like to treat as a property list). With this strategy, the record fields for the table will never change. In the above example, you would probably end up with entries like this: {user, "igorrs", [{"e-mail", "igorrs@example.org"}, {"name", "Igor"}]}. You can handle the value field very naturally with the lists:key* family of functions, for example.

There are a couple of disadvantages in the key-value approach I've just described: it wastes some space (to identify the "columns") and it complicates the creation of Mnesia secondary indices (you would have to create another record field and deal with the related problems). In exchange, you get flexible code that supports transformations to the values.

Whatever approach you decide to adopt, just don't forget that you may always need to add columns. Hence, you should write all of your code with that plan in mind.

To be continued...

Thursday, May 20, 2010

Mnesia - one year later

I've been working for a little more than a year now on a private back end project at my company, making heavy use of Erlang and Mnesia to build a distributed system that stores high volumes of data. With the system in production for some months and working really fine since its first deployment, I decided it was a good moment to try to remember the most important lessons I have learned about Mnesia, especially those that are not explicitly stated anywhere else.

Before I start, I'd like to repeat what many people know nowadays: Erlang is an absolutely marvelous language. I'm extremely grateful to Ericsson and the whole Erlang community, having no relevant complaints with respect to Erlang and the OTP (except for logging, but you can easily find long discussions on the subject, so I'm not talking about it here). I won't try to give you any advice with respect to the language itself, since there's already great material available here and there to lead you to a successful experience with Erlang.

Mnesia is also a great tool for developing distributed, scalable storage systems. It gives you data partitioning, replication, ACID transactions and many more important characteristics that just work and - you can bet it - are stable. I'm sure you can find - as I did - lots of good advices on using Mnesia, so I'll try not to repeat too much of what other people have already said. Obviously, I'm also not giving a tutorial on the basics of Mnesia here: I assume you are going to study elsewhere in case you decide to actually use Mnesia. Here is a good quick overview of table fragmentation (partitioning), for example.

A last warning: some of what I'm saying below may be wrong (you know... I didn't develop Mnesia) or might just not apply to every Mnesia database deployment, since my experiences are mostly based on a system with the following characteristics:
  • Distributed, running on several servers.
  • Employing fragmented tables, distributed across several servers.
  • Employing cloned (replicated) fragments.
  • Holding high volumes of data that don't fit in RAM.
  • Serving a big number of read requests per second (and also some write requests).
Here we go...

Mnesia's table fragmentation uses linear hashing
You decided to create a fragmented table and distribute the data among 3 servers, so that about 1/3 of the data goes to each server. You then created the table with 6 fragments and placed 2 fragments on each of your servers, what gives you uniformly distributed data, right? No!

This is an information that I could only find in the source code (and by "source code" I don't mean "comments"  ;-)): Mnesia's default hashing module (mnesia_frag_hash) uses linear hashing to distribute keys among fragments. That hashing scheme does a perfect job to make the number of fragments cheaply expansible and shrinkable, but it doesn't distribute data well if the number of fragments is not a power of 2.

In the example above, with 6 fragments, you are very likely to have a data distribution that resembles these quantities of keys per fragment: (X, X, 2X, 2X, X, X). If you distribute those 6 fragments among 3 servers naively, you may end up having one server with 100% more data than each one of the other!

The best solution is probably to create a bigger number of fragments that is a power of 2 and to then distribute those fragments as evenly as possible. If the math still doesn't work for your pool, you can think of implementing another hashing module (Mnesia supports that), possibly with consistent hashing.

Load-balancing read operations: do it yourself
You have created a foo table (or table fragment, for that matter), with copies on servers bar and baz, and you are reading that table's records from a number of different Erlang nodes in the same Mnesia pool. You read data by calling mnesia:activity(..., ..., ..., mnesia_frag), so that Mnesia transparently finds the data on one of the servers that hold it.

The example above is totally functional, but Mnesia won't load-balance those read operations: given a steady Mnesia pool, each node will always read the foo table from the same server (always bar or always baz). I've seen cases where all of the servers would read the foo table from the bar node (except baz, since local tables are always read locally, it seems). You can check where node bam is reading the foo table from by calling mnesia:table_info(foo, where_to_read) on node bam.

If you need the read operations to be well distributed among all clones of each table or fragment, you need to explicity set the where_to_read property on each node of the Mnesia pool, for each remote table it needs to access. I like to do this by selecting a random node from the list of hosting nodes (and repeating that process often, on each node, for each remote table), but you could want to choose some other strategy (for example: you might have 3 clones of a table and decide to have all the pool reading data always from the same 2 clones - who knows?). The important information here is that Mnesia will certainly not do what you want automatically.

To be continued...