Tame Wild Systems

Saturday, September 1, 2012

Remember the Milk vs. Producteev

I've been using Remember the Milk as a personal task manager for 5 years now, but I ended up being quite bored by the bad usability of its web interface (the Android app is fine). For that reason, I decided to do some research about the alternatives and found very good opinions on Producteev. In fact, after trying it for a few hours, I had to agree that Producteev's web interface is more beautiful and practical than Remember the Milk's.

However, Producteev is far from being as flexible as Remember the Milk. Actually, almost nothing is configurable in Producteev. One annoying example of this is the fact that any tasks defined as "all day" (no specific time) will have a reminder set for 5 p.m. by default. Since most of my tasks are "all day" (actually, "anytime" would be a better definition for them, but there's no such option), I would have to manually unset reminders all the time, if using Producteev. Not being able to do it while creating the task (changing reminders must be done after the task was added) makes the problem worse.

Another problem with Producteev is that you can't set a task description. This is also not possible in RTM, but you can use notes as a good replacement for that. Although you can also create notes in Producteev, they are not editable (!), so they can't really be used as descriptions.

Finally, there's something with Producteev's Android app that may look correct for some people, but that for me is enough reason to give up on Producteev: the notifications go away after you see them. This simple scenario would be enough to make me forget the milk:

I set a reminder for 4 p.m., concerning a quick task X that I must definitely start between 4 p.m. and 6 p.m.
I see the notification at 4 p.m., but I'm still busy with some other task Y, so I can't do X at that moment.
I finish Y at 4:30 p.m. and don't remember X anymore.
Even if I look at my phone several times between 4:30 p.m. and 6 p.m., I won't be reminded of X, unless I look at my full to-do list at some moment.

For now, I'm staying with the good ol' Remember the Milk.

Saturday, August 4, 2012

Dual-booting Ubuntu 12.04 with EFI in a Sony Vaio SVS13115FBB

Hello. Today I just wanted to share this tip from another blog.
I spent many hours trying to figure out a way to dual-boot Ubuntu in a Sony Vaio SVS13115FBB laptop with Windows 7 64-bit. Of course, the difficulties arose because I didn't want to use the legacy BIOS mode (I also kept GPT in the disk, by the way - no intention of switching to MBR).
This comment was also useful, as it contains some very practical basic information on installing Ubuntu in laptops with UEFI.
But the Vaio-specific tip really nailed it (and wasn't easy to find), so here is the link again:
http://bubu-online.blogspot.com.br/2012/07/sony-vaio-with-insyde-h2o-efi-bios.html
See you.

Saturday, May 21, 2011

Software developers and superuser privileges in the production environment

Oh, yes: I have decided to take part in that holy war, instead of humbly writing another post about Mnesia (which may come some day...).

First of all, let's make clear that I am a software developer (currently in a technical lead position, but still a developer), so that you can read this post with all of your prejudices on.

Like most people, I believe that, in general, developers should only have read-only access to the production environment. But I am not here to discuss that issue of read-only access vs. no access. I know there are situations where developers shouldn't have any account in production servers (yeah, yeah... financial institutions... I know) and there are situations where you want to give superuser powers to software developers. I am here to talk about a proper subset (:-)) of the latter.

There are two crucial points you need to consider before deciding to give root access to a software developer (or even to a new sysadmin) in the production environment:

Ownership/accountability/responsibility. This guy who you are planning to give root access to production... does he care? What happens if anyone breaks anything, be it in the code or in the environment? Will he work 20 hours in a row until it's fixed? Is he going to be personally asked to explain what happened, how it was solved and what will be done so that it does not happen again? Does he want the operations team to be trained, because he will have crappy quality of life if they aren't? Does he want disciplined source code versioning, so that tracking problems won't turn into a nightmare? In other words: is the company's health related to this guy's well-being? No? Hum...
Some systems administration skills. Is your developer one of those guys who are clueless about the production environment and do not want to take part in non-programming decisions about the system? Do you think he will accidentally do stuff like forgetting to close a root shell after he is done with it, filling up all disk space in a partition by copying a huge file to his home directory, rm -r /tmp/* to restore some disk space, running an "analysis" that severely affects CPU/memory/bandwidth etc.? Yes? Hum...

Now, suppose you have some developers who are "eligible" (according to the rules above) to have superuser privileges in production and you are planning to give them that. Some people will argue, based on their one-rule-fits-all beliefs, that developers should never, ever have it. But there's a chance they are wrong in your case, so let me analyse three of their most common arguments, starting with the most (only?) relevant one...

"If your developers need access to the production environment, you are doing it wrong. They are making bad code and/or you don't have a good operations team"

That's a good point, but let's be more specific about it. Developers with the two qualities I listed above will interfere with the production environment if there's something wrong that they don't believe can be solved in the minimum possible time by the operations team alone. What this actually means is that at least one of these is happening:

Developers are not giving operations all the necessary tools to deal with problems.
Systems administrators were not given the necessary training or documentation to use the available tools.
Systems administrators are not competent enough to use all of the available tools to solve hard problems.
The operations team does not have enough resources (sysadmins with available time to handle big problems whenever they happen).

So, it's true that you have other problems and you need to solve them. However, the problems above (except the second) may happen for fair reasons that will leave you no smarter option than giving root access to developers when big situations happen, until you have the above sorted out. Two of those "fair reasons" can be:

The operations team needs more people (with competence, of course), but they are not available in the local market.
The system is too damn complex and big, so that the developers can not even anticipate the majority of the problems that may happen, to build a complete set of tools.

One example of this last kind of system comes from my personal experience of working with great developers in building distributed systems (where the servers in the pool communicate and coordinate to perform tasks) to accomplish several goals, including sharding, that run in hundreds of servers and affect millions of users. The time it takes to develop administration tools and adequate error handling for these systems is much higher than the time it takes to develop their functionality and it doesn't matter how skilled you are: you won't get it right the first time.
To be reasonable, I should add that, even if your system is extremely complex, having a dedicated, correctly-sized and competent operations team, ready to immediately follow all instructions given by the developers (unless they disagree, of course), may eliminate the requirement of having developers with superuser privileges. But if the last items above (1 and 2) apply to your case, the most relevant effect of not allowing developers to fully access the production environment will be: higher downtime when a big problem happens.

"If people have access to production, eventually they will break something"

That's another inaccurate claim: if you accept it as it is, then you shouldn't have sysadmins, because some day they will break something. The truth in the sentence lies in the fact that you really should make every possible effort to minimize the number of situations where somebody needs to login to a production server (on that topic, see section 5.4 - Steady State - of the great book Release It!). But one day a big intervention may be needed and allowing one more person (a developer) to modify the production environment may actually result in a lighter total intervention (under the situations I described before). The mistake in the claim above is that it tries to associate "breaking things" to the number of people having access to production, when it would be less imprecise to relate it to the number of interventions performed in the environment.

"Deploying fresh code to production is bad"

This argument is so dishonest that the only reason I am commenting it is because it is astoundingly common. What it says is, basically, that developers are irresponsible people who, if given the chance, will start coding directly in production and deploying without testing, because they are stupid and do not know what the consequences of that may be. If you think your developers are like that, just have them all fired and hire another team. Decent developers won't use superuser access as developers: they will use it as sysadmins with development knowledge who are there to help the more skilled sysadmins, coordinating with them. I am a developer who has personally had root access to production in several occasions (although not for all of my systems/servers), since a time when I wasn't so experienced. I have done a relevant number of important interventions in production, but I have never:

Broken anything (it would affect my quality of life, because I'm accountable. Remember?).
Changed something without: telling the operations team, deciding/discussing the reason it had to be changed manually, and planning to incorporate it to the system or to the deployment docs (I want everybody to know as much as possible, because that raises my quality of life. Remember?).
Deployed code to production without testing (actually, I did it once, but I explained the reasons before I did, asked for permission and let a sysadmin do that). It doesn't matter if I am root: I respect people's roles (as long as they are important, like systems administrators), I don't want to do other people's jobs, I want them to know everything and I don't want good professionals to think I am irresponsible. I just want quality of life. :-)

In the end, this discussion is, like many others, much more complex than most people believe it to be. By now, I just hope I have helped you decide for you own situation, using your reasoning and not only your faith.

Sunday, May 23, 2010

Mnesia - one year later (part 3)

This is the third part of my series of posts about Mnesia. From this third post on, I'll probably start to include some information that you can find elsewhere without too much trouble. After all, it's good to have them arranged together. I'm also talking about stuff that might be a little bit obvious for some people, like this next one...

Automate table management
As you certainly know, once tables are scaled horizontally, you will need several administration procedures that a traditional RDBMS won't. If your pool of servers deals with no more than some tens of fragments for each table, maybe you can still manage the data distribution by directly calling functions like mnesia:move_table_copy/3, mnesia:del_table_copy/2, mnesia:add_table_copy/3, mnesia:change_table_frag/2, etc. But, once you arrive at the hundreds of fragments, it becomes hard to build the desired topology for you pool manually, especially if you have at least three servers and expect the pool to grow eventually.

I can't tell you how to automate the management of the tables' fragments in general (and Mnesia also can't do it for you), because that's totally dependent on how you need your pool to be built, considering characteristics like:

How many copies does each table need?
What's the span of each table across the pool?
Do you need any colocation for keys on different tables?

Once you decide what kinds of replication and partitioning you need for each of your tables, the functions above, together with the tools I described on my last post, will be almost everything you will use to administer the fragments automatically.

The number of fragments should be just "big enough"
In this section, I will only talk about disc_only_copies tables, since I haven't experimented intensely with disc_copies. Nevertheless, some of my comments may also apply to other kinds of tables (e.g. you obviously can run performance tests with disc_copies tables too).

The important (and somewhat obvious) message that shouldn't be forgotten is: with dets (Mnesia's disc_only_copies), you lose performance when you raise the number of fragments, even if the total amount of data doesn't change. If you create too many fragments, more resources (more disks, for example) will be required by Mnesia to keep the throughput. So, if you think you will turn the dets limits into a non-issue just by "creating a million fragments and never thinking about it again", you're wrong.

As you may know, with disc_only_copies, you have a 2 GB limit on each fragment, so, at the very minimum, you should have enough fragments to keep each one with less than 2 GB. But people will also tell you that the throughput on a fragment will fall drastically way before the fragment's size approaches 2 GB. Thus, you need to create enough fragments to keep the size of each one below a given "performance threshold". Fine, but what's that threshold?

I will only harm you if I try to teach you numbers, with respect to good sizes for dets tables, since you can find your numbers by yourself and you should do that. Just open an Erlang shell, create a dets table (dets:open_file/2), insert some of your data in it, perform insert, read, and delete operations, insert more data, perform more operations and so on, until you can draw a good performance curve that applies to the kind of data you're dealing with.

In short, these are my general advices:

Test dets with your data, considering each of the main tables you will need. This is the way to determine the maximum size (or maximum number of records) you can have in each fragment of each table without losing too much performance. Don't forget to test all kinds of operations (read, write, and delete), with an emphasis on what you need to do more often.
Create enough fragments to keep you comfortable (with a high probability) for a long time.
Don't forget 2 GB is the breaking point. Never allow any disc_only_copies fragment to reach that limit!
If your fragments grow too much, add more fragments. Adding one fragment is not so expensive, since Mnesia uses linear_hashing, and you don't need to stop Mnesia to do that. However, with linear hashing, you will need to double the number of fragments to make every one of them shrink (you are already using a power of 2, right? ;-)). Therefore, you must start adding fragments as soon as possible.

Be careful with the transaction log
This is just a reference. The main idea here is: just because Mnesia is giving you the required throughput for write operations, it doesn't mean it's not overloaded. Follow this link to learn more about this issue. Some searches will also point you to other sources of information.

Mnesia copies entire fragments from the replicas when it starts
If you have cloned tables (or table fragments) and, for example, you restart your node cool_node@interwebs, it will copy each modified table (considering each fragment as a table, of course) entirely from some other node to cool_node@interwebs. That fact implies that, if you have a node down for a long time and then you restart it, you might see a huge amount of data being read in remote nodes and a big raise in network traffic, until the new node is fully consistent again.

There's not very much you can do to make this "synchronization" procedure better. I suggest you play with two Mnesia options that may make this issue less painful: no_table_loaders and send_compressed. The first one regulates the maximum "rate" for the copies and the second one activates compression (thus reducing network traffic and raising CPU utilization). You can find more information about them in the documentation for the mnesia module.

To be continued (I guess)...

Friday, May 21, 2010

Mnesia - one year later (part 2)

This is my second post about the not-so-well-known relevant facts I have learned by working on a project based on Mnesia. Check my last post before reading this one. Let's go...

Do not use mnesia_frag to list records
Suppose your foo table, properly fragmented and distributed across dozens of servers, contains a secondary index on its second field. I guess you will eventually want to list all the records that have some bar value on that indexed field, since that's the reason for creating indices (;-)). After checking some Mnesia tutorial, you realize it is ridiculously easy to make that query. Here is an example:
mnesia:activity(..., fun qlc:e/1, [qlc:q([K || #foo{key = K, field2 = bar, _ = '_'} <- mnesia:table(foo)])], mnesia_frag).

The operation above works well. But what happens if one of the fragments of foo is unavailable (e.g. because one of your servers is down)? What Mnesia does is what a generic DBMS should do: it throws an exception. However, depending on your application's domain, there's a good chance that what you want is to retrieve all of the available records (in other words: you may prefer a partial list than no data at all). This issue was pointed out by one of my colleagues (Ivan) a long time ago.

Mnesia gives you all the necessary tools to list data from a fragmented table in different ways:

You can check the fragmentation properties of the foo table (including its number of fragments) with mnesia:table_info(foo, frag_properties).
You can access fragments individually (they are just tables) by appending a suffix to the table's name. Fragment 1 is called foo, fragment 2 is called foo_frag2, fragment 10 is called foo_frag10 and so on.
You can locate the nodes that hold, for instance, foo_frag14 by calling mnesia:table_info(foo_frag14, where_to_write).
You can easily make remote procedure calls to other nodes (it's Erlang...).

Just put all of the above tools together to make the best solution for your case.

There is no sorted iteration of fragmented tables
This is very important to keep in mind. If your tables don't fit in memory and, as a consequence, you are using disc_only_copies, you don't have sorted data inside each fragment (since dets tables aren't sorted) and you also don't have sorted data between fragments (remember? It's linear hashing). Those facts imply that, if you have 100 million records in your fragmented table and you just need to retrieve the smallest key, you absolutely need to read 100 million entries from storage. Of course, in this example, you can keep the desired result elsewhere and even try to keep it updated. But, if you need to find an arbitrary interval of keys considering the table in ascending order, the problem starts to get hard.

If your full table fits in RAM and its fragments are disc_copies or ram_copies, you are in better shape: those storage types support the ordered_set table type. But you still don't have sorting between the fragments (linear hashing again), what makes it necessary to read something from every fragment whenever you need to perform an operation that depends on global ordering.

If you need globally sorted keys (and your fragmented table fits in RAM), maybe you can try to implement another hashing module (according to the mnesia_frag_hash behaviour) to keep global ordering. However, you need to be careful with the distribution of data between fragments (predict your keys, indeed) and perhaps consider what the performance will be when you raise or decrease the number of fragments (how much data will need to be moved between fragments?). There's no way to tell Mnesia to "split" some fragment that has grown too much (like Bigtable does with its tablets), so keeping global ordering is not easy.

Adopt the key-value approach
Mnesia records are Erlang records. By default, you have all of the comfort of the Erlang record syntax at your disposal when writing code for Mnesia. When you create a user table with fields nickname, email, and name, its entries will be tuples like this: {user, "igorrs", "igorrs@example.org", "Igor"}. When working with that table, you'll be able to use an Erlang record called user to write very clean code. For instance, you can retrieve the name field of the record whose key is Nickname with this call:
[#user{name = Name}] = mnesia:dirty_read({user, Nickname}).

So, now, you have the beautiful code above running in your big pool of servers and someone asks you to add the gender field to the user table. Easy: that's what the function mnesia:transform_table/3 was made for. You just have to run it and... watch your code break. But wait! You can also update the record definition in your code, so that it doesn't break! You surely can do that, once all of the records on your big table with millions of entries are updated. And please don't run any code (old or new) while the table is being transformed.

Conclusion: basically, if you want to use that convenient Erlang record syntax, you have to agree to stop your whole system every time you need to add a field to a table, what's a very big price to pay. If you want to avoid it, you may try to write an ugly style of code that expects the shape of the records to change. I've never used this technique, but it should work.

My favorite way of dealing with the "column-addition" problem, however, is to use a fixed record with only two fields: one for the key and one for the value, which is an Erlang term (which I like to treat as a property list). With this strategy, the record fields for the table will never change. In the above example, you would probably end up with entries like this: {user, "igorrs", [{"e-mail", "igorrs@example.org"}, {"name", "Igor"}]}. You can handle the value field very naturally with the lists:key* family of functions, for example.

There are a couple of disadvantages in the key-value approach I've just described: it wastes some space (to identify the "columns") and it complicates the creation of Mnesia secondary indices (you would have to create another record field and deal with the related problems). In exchange, you get flexible code that supports transformations to the values.

Whatever approach you decide to adopt, just don't forget that you may always need to add columns. Hence, you should write all of your code with that plan in mind.

To be continued...

Thursday, May 20, 2010

Mnesia - one year later

I've been working for a little more than a year now on a private back end project at my company, making heavy use of Erlang and Mnesia to build a distributed system that stores high volumes of data. With the system in production for some months and working really fine since its first deployment, I decided it was a good moment to try to remember the most important lessons I have learned about Mnesia, especially those that are not explicitly stated anywhere else.

Before I start, I'd like to repeat what many people know nowadays: Erlang is an absolutely marvelous language. I'm extremely grateful to Ericsson and the whole Erlang community, having no relevant complaints with respect to Erlang and the OTP (except for logging, but you can easily find long discussions on the subject, so I'm not talking about it here). I won't try to give you any advice with respect to the language itself, since there's already great material available here and there to lead you to a successful experience with Erlang.

Mnesia is also a great tool for developing distributed, scalable storage systems. It gives you data partitioning, replication, ACID transactions and many more important characteristics that just work and - you can bet it - are stable. I'm sure you can find - as I did - lots of good advices on using Mnesia, so I'll try not to repeat too much of what other people have already said. Obviously, I'm also not giving a tutorial on the basics of Mnesia here: I assume you are going to study elsewhere in case you decide to actually use Mnesia. Here is a good quick overview of table fragmentation (partitioning), for example.

A last warning: some of what I'm saying below may be wrong (you know... I didn't develop Mnesia) or might just not apply to every Mnesia database deployment, since my experiences are mostly based on a system with the following characteristics:

Distributed, running on several servers.
Employing fragmented tables, distributed across several servers.
Employing cloned (replicated) fragments.
Holding high volumes of data that don't fit in RAM.
Serving a big number of read requests per second (and also some write requests).

Here we go...

Mnesia's table fragmentation uses linear hashing
You decided to create a fragmented table and distribute the data among 3 servers, so that about 1/3 of the data goes to each server. You then created the table with 6 fragments and placed 2 fragments on each of your servers, what gives you uniformly distributed data, right? No!

This is an information that I could only find in the source code (and by "source code" I don't mean "comments" ;-)): Mnesia's default hashing module (mnesia_frag_hash) uses linear hashing to distribute keys among fragments. That hashing scheme does a perfect job to make the number of fragments cheaply expansible and shrinkable, but it doesn't distribute data well if the number of fragments is not a power of 2.

In the example above, with 6 fragments, you are very likely to have a data distribution that resembles these quantities of keys per fragment: (X, X, 2X, 2X, X, X). If you distribute those 6 fragments among 3 servers naively, you may end up having one server with 100% more data than each one of the other!

The best solution is probably to create a bigger number of fragments that is a power of 2 and to then distribute those fragments as evenly as possible. If the math still doesn't work for your pool, you can think of implementing another hashing module (Mnesia supports that), possibly with consistent hashing.

Load-balancing read operations: do it yourself
You have created a foo table (or table fragment, for that matter), with copies on servers bar and baz, and you are reading that table's records from a number of different Erlang nodes in the same Mnesia pool. You read data by calling mnesia:activity(..., ..., ..., mnesia_frag), so that Mnesia transparently finds the data on one of the servers that hold it.

The example above is totally functional, but Mnesia won't load-balance those read operations: given a steady Mnesia pool, each node will always read the foo table from the same server (always bar or always baz). I've seen cases where all of the servers would read the foo table from the bar node (except baz, since local tables are always read locally, it seems). You can check where node bam is reading the foo table from by calling mnesia:table_info(foo, where_to_read) on node bam.

If you need the read operations to be well distributed among all clones of each table or fragment, you need to explicity set the where_to_read property on each node of the Mnesia pool, for each remote table it needs to access. I like to do this by selecting a random node from the list of hosting nodes (and repeating that process often, on each node, for each remote table), but you could want to choose some other strategy (for example: you might have 3 clones of a table and decide to have all the pool reading data always from the same 2 clones - who knows?). The important information here is that Mnesia will certainly not do what you want automatically.

To be continued...

Thursday, November 19, 2009

Consistent hashing for Mnesia fragments - part 2

Using the consistent hashing scheme I described in my last post, I have created a table with 400 fragments and done a bunch of writes to it. Although data distribution was very good in this test, the performance was really crappy, so I supposed Mnesia was doing some copying around of the hash_state (which contains the tree I use for consistent hashing).

As Ulf Wiger commented here, Mnesia is actually storing the structure in an ets table and looking it up at every operation, what results in a copy of the whole tree even for one read operation!

Not surprisingly, if I decrease the number of entries per fragment in the consistent hashing tree to 10, perfomance becomes good, but then data distribution becomes worse than with the default mnesia_frag_hash.

Lesson learned: if you implemented your own hashing module for Mnesia, be careful with the size of the hash_state, as this record is fully copied at every operation!

But let's not give up consistent hashing yet: an idea that comes naturally is to store the entries in a mutable structure, like ets, and then have a "pointer" to that structure in the hash_state. To avoid losing the fragmentation info, I'm going to create a Mnesia disc_copies table and make raw accesses to it using ets, as suggested by Ulf Wiger.

Things are a little bit less automatic now. I have opted to leave the creation of the consistent hashing table outside of mnesia_frag_chash. For example: if you are going to create a table called person, you first need to create a person_chash table, with the following characteristics:
- NOT fragmented.
- Type: ordered_set.
- Storage: disc_copies.
- Attributes: 2, including the key (minimum required by Mnesia - I don't use the second attribute).
- Replicated in every node that is going to access the person table (for performance reasons).

Here is the code for that example, using one node and 100 fragments:

Not so difficult. If you are thinking of using that on your next system, I can tell you the performance for "normal" write operations (writing/deleting records) will not be affected and that reading records will be a little bit slower (you are reading from an additional ets table before you get your data). Adding and deleting fragments, on the other hand, will take more time. How much more? I could guess some number (say, 100% more time), but that depends on so many factors that I strongly encourage you to do your own tests, in your own scenario.

As for distribution of records among fragments, you will get better results, really. Check this distribution of records after inserting 10 million random short strings using mnesia_frag_chash:
[215275,196101,211857,176795,189177,190949,210562,173166,155642,199126,
174730,201721,185447,224594,176798,165405,192294,206536,210361,201988,
168968,203098,255504,201828, 162979,257371,211712,205513,200564,229616,
211546,218419,179620,210396,234415,202573,192363,198798,171920,181552,
204337,237357,215541,199831,210681,203614,195590,191930,200881,182959].
Average: 200,000.
Variance: 452,627,581 (standard deviation: 21,275).
Biggest fragment: 257,371.

And then the same 10 million records using mnesia_frag_hash:
[155936,155724,155470,155973,156038,155962,156407,156726,156739,156873,
156362,156388,156306,156502,156278,156292,155672,156087,312118,313014,
313462,312910,311754,311992,312984,311970,311902,312021,312190,312429,
312329,313125,156414,155736,156920,156362,156708,155696,156881,156233,
156171,156341,156353,156373,156292,156460,156158,156548,156001,156418].
Average: 200,000.
Variance: 4,917,043,219 (standard deviation: 70,122)
Biggest fragment: 313,462.

Source code is just mnesia_frag_chash.erl and you can find more useful stuff here.

If you want to run your tests with your own data, send me a private message: maybe I have some useful functions for you.