Tame Wild Systems: Software developers and superuser privileges in the production environment

Oh, yes: I have decided to take part in that holy war, instead of humbly writing another post about Mnesia (which may come some day...).

First of all, let's make clear that I am a software developer (currently in a technical lead position, but still a developer), so that you can read this post with all of your prejudices on.

Like most people, I believe that, in general, developers should only have read-only access to the production environment. But I am not here to discuss that issue of read-only access vs. no access. I know there are situations where developers shouldn't have any account in production servers (yeah, yeah... financial institutions... I know) and there are situations where you want to give superuser powers to software developers. I am here to talk about a proper subset (:-)) of the latter.

There are two crucial points you need to consider before deciding to give root access to a software developer (or even to a new sysadmin) in the production environment:

Ownership/accountability/responsibility. This guy who you are planning to give root access to production... does he care? What happens if anyone breaks anything, be it in the code or in the environment? Will he work 20 hours in a row until it's fixed? Is he going to be personally asked to explain what happened, how it was solved and what will be done so that it does not happen again? Does he want the operations team to be trained, because he will have crappy quality of life if they aren't? Does he want disciplined source code versioning, so that tracking problems won't turn into a nightmare? In other words: is the company's health related to this guy's well-being? No? Hum...
Some systems administration skills. Is your developer one of those guys who are clueless about the production environment and do not want to take part in non-programming decisions about the system? Do you think he will accidentally do stuff like forgetting to close a root shell after he is done with it, filling up all disk space in a partition by copying a huge file to his home directory, rm -r /tmp/* to restore some disk space, running an "analysis" that severely affects CPU/memory/bandwidth etc.? Yes? Hum...

Now, suppose you have some developers who are "eligible" (according to the rules above) to have superuser privileges in production and you are planning to give them that. Some people will argue, based on their one-rule-fits-all beliefs, that developers should never, ever have it. But there's a chance they are wrong in your case, so let me analyse three of their most common arguments, starting with the most (only?) relevant one...

"If your developers need access to the production environment, you are doing it wrong. They are making bad code and/or you don't have a good operations team"

That's a good point, but let's be more specific about it. Developers with the two qualities I listed above will interfere with the production environment if there's something wrong that they don't believe can be solved in the minimum possible time by the operations team alone. What this actually means is that at least one of these is happening:

Developers are not giving operations all the necessary tools to deal with problems.
Systems administrators were not given the necessary training or documentation to use the available tools.
Systems administrators are not competent enough to use all of the available tools to solve hard problems.
The operations team does not have enough resources (sysadmins with available time to handle big problems whenever they happen).

So, it's true that you have other problems and you need to solve them. However, the problems above (except the second) may happen for fair reasons that will leave you no smarter option than giving root access to developers when big situations happen, until you have the above sorted out. Two of those "fair reasons" can be:

The operations team needs more people (with competence, of course), but they are not available in the local market.
The system is too damn complex and big, so that the developers can not even anticipate the majority of the problems that may happen, to build a complete set of tools.

One example of this last kind of system comes from my personal experience of working with great developers in building distributed systems (where the servers in the pool communicate and coordinate to perform tasks) to accomplish several goals, including sharding, that run in hundreds of servers and affect millions of users. The time it takes to develop administration tools and adequate error handling for these systems is much higher than the time it takes to develop their functionality and it doesn't matter how skilled you are: you won't get it right the first time.
To be reasonable, I should add that, even if your system is extremely complex, having a dedicated, correctly-sized and competent operations team, ready to immediately follow all instructions given by the developers (unless they disagree, of course), may eliminate the requirement of having developers with superuser privileges. But if the last items above (1 and 2) apply to your case, the most relevant effect of not allowing developers to fully access the production environment will be: higher downtime when a big problem happens.

"If people have access to production, eventually they will break something"

That's another inaccurate claim: if you accept it as it is, then you shouldn't have sysadmins, because some day they will break something. The truth in the sentence lies in the fact that you really should make every possible effort to minimize the number of situations where somebody needs to login to a production server (on that topic, see section 5.4 - Steady State - of the great book Release It!). But one day a big intervention may be needed and allowing one more person (a developer) to modify the production environment may actually result in a lighter total intervention (under the situations I described before). The mistake in the claim above is that it tries to associate "breaking things" to the number of people having access to production, when it would be less imprecise to relate it to the number of interventions performed in the environment.

"Deploying fresh code to production is bad"

This argument is so dishonest that the only reason I am commenting it is because it is astoundingly common. What it says is, basically, that developers are irresponsible people who, if given the chance, will start coding directly in production and deploying without testing, because they are stupid and do not know what the consequences of that may be. If you think your developers are like that, just have them all fired and hire another team. Decent developers won't use superuser access as developers: they will use it as sysadmins with development knowledge who are there to help the more skilled sysadmins, coordinating with them. I am a developer who has personally had root access to production in several occasions (although not for all of my systems/servers), since a time when I wasn't so experienced. I have done a relevant number of important interventions in production, but I have never:

Broken anything (it would affect my quality of life, because I'm accountable. Remember?).
Changed something without: telling the operations team, deciding/discussing the reason it had to be changed manually, and planning to incorporate it to the system or to the deployment docs (I want everybody to know as much as possible, because that raises my quality of life. Remember?).
Deployed code to production without testing (actually, I did it once, but I explained the reasons before I did, asked for permission and let a sysadmin do that). It doesn't matter if I am root: I respect people's roles (as long as they are important, like systems administrators), I don't want to do other people's jobs, I want them to know everything and I don't want good professionals to think I am irresponsible. I just want quality of life. :-)

In the end, this discussion is, like many others, much more complex than most people believe it to be. By now, I just hope I have helped you decide for you own situation, using your reasoning and not only your faith.

Tame Wild Systems

Saturday, May 21, 2011

Software developers and superuser privileges in the production environment

No comments:

Post a Comment

About Me

Followers

Blog Archive