Xentac's Xenophilia

A love of the unknown and alliteration

2012
Jan
19

The Real Difference Between Git and Mercurial

I have a friend who is quite proficient with git but has recently started a job that uses mercurial for their development and he’s been learning how to use it. I used this opportunity to do some research into the history of git and mercurial and why they’ve turned out to be such different (yet similar) tools. I will start off by saying that I am a fan of git but my intention is not to show that one is better than the other, only to highlight the differences between the two and why they are the way they are. I will try to include references where I have found them.

There are a number of commonalities between git and mercurial: they are both version control systems; they both refer to their revisions with hashes; they both represent history as Directed Acyclic Graphs (DAGs); they both offer a lot of high level functionality like bisect, history rewriting, branches, and selective commits.

Both git and mercurial were developed to solve a large problem that was happening in 2005. The Linux kernel was no longer allowed to use BitKeeper for free as its version control system. Having used BitKeeper for 3 years, the kernel developers had become accustomed to their new distributed workflow. No longer were patches emailed between people, lost, resubmitted, and managed personally by a series of shell scripts. Now patches/features were recorded, pulled, and merged by a fancy piece of software that made it possible to track history over long periods and hunt down regressions.

It also strengthened the kernel development workflow where there was one Dictator and multiple Lieutenants, each responsible for their subsystems. Each Lieutenant vetted and accepted patches to their subsystems and Linus pulled their changes and made the official linux repository available. Anything that replaced BitKeeper would have to enable this workflow.

Not only did any replacement need to support a distributed workflow, it also had to be fast for a large number of changes and files. The Linux kernel is a very large project that has thousands of changes each day contributed by thousands of people.

Lots of tools were evaluated and none quite passed muster. Matt Mackall decided to create mercurial to solve the problem around the same time1 that Linus decided to create git. Both borrowed some ideas from the monotone project. I will try to identify those where I recognize them.

Both git and mercurial identify versions of files with hashes. File hashes are combined in manifests (git calls them trees and git trees can also point to other trees). Manifests are pointed to by revisions/commits/changelogs (commits from now on). The key to how the various tools differ is how they represent these concepts.

Mercurial decided to solve the performance problem by developing a specialized storage format: Revlog2. Every file is made up of an index and a data file. Data files contain snapshots and deltas–snapshots are only created if the number of deltas to represent a file goes over a threshold. The index is key to efficient access to the data file. Changes to files are only ever appended to the data file. Because files aren’t always changed sequentially, the index is used to group parts of the data file into coherent chunks that represent a particular file version.

From file revisions, manifests are created and from manifests, commits are created. Creating, finding, and calculating differences to files are very efficient given this method. It takes a relatively small amount of space on the disk to represent these changes. The network protocol to transfer changes is similarly efficient.

Git takes the opposite approach: file blobs3. To store revisions quickly, each new file revision is a complete copy of the file. These copies are compressed, but there is a lot of duplication. The developers of git have created methods to reduce the storage requirements by packing data–essentially creating something like a revlog at a given point in time. These packs are not the same thing as a revlog, but serve a similar purpose of storing data in a space efficient format.

Because git stores everything in files, its history is a lot more fluid. Object files can be copied in from anywhere using any method (e.g. rsync). Commits can be created or destroyed. Just as history isn’t linear in distributed version control world, git’s data model doesn’t depend on linear files. Mercurial’s file format is to git as compressed files are to sparse files.

Both tools have the notion of branches, but they are different. A mercurial branch is something that is added to a commit and sticks around forever. Anyone who pulls from you will see all the branches that are in your repository and which commits are in each one. There are ways to do git branches in mercurial, but we will get into that later when we talk about extensions.

Git branches are just pointers to commits. That’s it. They do nothing other than tell the git client, “when I’m in this branch, this is what my working copy looks like”. They can point to different commits, they can be deleted, they can be passed around (each one is uniquely identified by the local name of the repository it came from). There is one convenience that the git client offers you: when you make a commit, your branch pointer is automatically updated.

Generally when people want git branches in mercurial, they create a new clone. That’s great if all you want is to create commits that represent two concurrent development streams, but if you want to start merging between them or comparing histories, you need tools that understand these two directories are related in some ways (I’m sure extensions exist to do that, but I’m getting to that).

Mercurial branches serve a different purpose than git branches. Mercurial branches represent a shared place for development to happen outside the default branch. Because everyone shares branch names, they are reserved for long-living versions of your project.

Given these differences, it’s no wonder that git and mercurial have different interfaces. Mercurial makes it easy to create commits, push and pull them, and generally move history forward. Git doesn’t care about history moving forward, all it cares about is creating commits and pointing at them. It doesn’t matter what the commits represented previously or what the pointers used to point at, this is what they mean now.

There do exist safeguards to make sure that git branch changes don’t destroy any history that was previously pulled down into a local repository: the fast-forward merge. While it’s true that git will complain if a fetch tries to pull down commits that can’t be reconciled without moving forward, these errors can be overridden if you expect the changes.

This belies one of the main differences I’ve found between git and mercurial. When a git user runs into a problem, they look at the tools they have on hand and ask, “how can I combine these ideas to solve my problem?” When a mercurial user runs into a problem, they look at the problem and ask, “what code can I write to work around this?” They are very different approaches that may end up at the same place, but follow alternate routes.

To rollback a commit or a pull/merge, git just points the branch pointer at the old commit. In fact, any time you want to go back to a previous state, git keeps a reflog to tell you what the commit hash was at that point. As long as something has been committed, you can always get it back in git.

As far as I know, there are cases in mercurial where you can’t get back to where you were. Because solving a problem in mercurial generally creates another commit, it can be hard in some cases to say, “put me back to the moment exactly before I screwed everything else up”.

To solve problems in mercurial, you end up with a lot of extensions. Each extension solves its particular problem well without the benefit of the underlying data model. Combining features and functionality complicates the use of extensions.

Here is a great example of that. In git, if you want to record your current working directory state without creating a local branch/commit, you can use a stash. What is a stash? It’s a commit and branch that isn’t stored in the standard place. It doesn’t show up when you ask for all the branches (git branch), but all of the tools can treat it like a branch. Once you’ve created a few stashes (they will create a linear history in a single special branch), it’s possible to do things like compare them to existing files (or refer to them based on the time they were created) using standard syntax.

If you want do the same thing in Mercurial, you can use the attic or shelve extension (or the pbranch extension, says the attic extension page). These both store the stashed patches as files in the repository that can be committed if necessary. Each one solves a slightly different problem in slightly different ways instead of being able to use the underlying “plumbing”4 to store data in a consistent manner.

Another great example is git commit --amend. If you want to modify the most recent commit, to add something you forgot or just change the commit message, git commit --amend will create a whole new set of file objects, tree objects, and a commit object. After it’s done those things, it updates the branch pointer. If you then decide that that wasn’t really what you wanted to do, you can just point the branch pointer back at the previous commit with git reset --hard HEAD@{1} (or by looking through the reflog for the commit hash that the branch used to point at).

To do the same thing in Mercurial, there are a few options: you can rollback the commit and then create a new one, but all records of the original commit are gone or you could use the queue extension to import the last commit, then modify it with your current changes, then create a new commit. Neither of these options benefits from any features that mercurial’s data store offers, they exist solely to work around it.


  1. I have seen it said that mercurial was an older and more mature project than git, but Matt Mackall says that Linus had a few days’ head start.

  2. Matt Mackall released a paper on Revlog and Mercurial at the Ottawa Linux Symposium, 2006.

  3. The Git Object Model from the Git Community Book.

  4. Git refers to the underlying code as “plumbing” and the user interface code as “porcelain”.

 
2011
Oct
26

"They download it off my hard drive?" and other misconceptions about internet security

“They download it off my hard drive?” was the question I received after explaining the basics of how BitTorrent works to a layperson.

For those of you who don’t know know, BitTorrent is a protocol that lets many people share large files over the internet quickly. It was unique when it came out because everyone who wanted to download a “torrent” also had to upload it to other people. Instead of having one person send all the data to everyone who requests it, a torrent is broken up into chunks and you can download a chunk from anyone who has it, just as someone else can download a chunk you have from you.

This person’s concern was that if someone can access the torrent chunks on their hard drive, what’s stopping them from accessing any files? The short (and alarmist) answer is nothing, but it’s much more nuanced than that. It also belies a misconception about how computers, data, and applications interact.

There are programs that let remote people download any file from your computer; they’re called malware. Malware is a program that run on your computer and give access to remote users. Depending on what the program is written to do it may upload files to remote locations, delete files, install other programs, log all the keys you type, or a host of many other things.

The main difference between malware and regular, safe programs is intention. What is the intent of the software? Any software on your computer could upload your files, you just trust that most don’t. That new word processor you installed, the driver for your video card, every little game you download, or something sent to your email could contain malicious code that intends to do bad things to your computer. That’s why we have to be careful when opening attachments that end up in our inboxes.

Just because a particular program’s primary intention is to download from and upload to other people on the internet doesn’t make it inherently more or less safe; once you run it on your computer it can do just about anything. Instead you have to ask if you trust a particular program to do what you think it does. In the case of a popular BitTorrent client, you’re probably pretty safe.

 
2011
Oct
05

Node.js is Candy

This is a response to Node.js is Cancer. While I may agree with certain points, I take issue with the tone and a number of the implications and inferences.

The article starts off talking about how all function calls that do CPU work also block. This is true. Just as the LinkedIn developers found that “[they] have a recommendation engine that does a ton of data crunching; Node isn’t going to be the best for that”, so too will you find that Node.js isn’t the hottest for pure CPU bound load.

How often do you do lots of data crunching in your product? More often it’s a load of db queries, possibly some calls to remote APIs, maybe even disk access. In all of these cases, an event loop-based service won’t be waiting around for responses before moving on to other work.

The author says he’s “God Damned terrified of any “fast systems” that “less-than-expert programmers” bring into this world” (referring to a quote from the Node.js homepage). Generally I prefer to test/monitor any system written by any programmer, instead of just assuming that expert or less-than-expert programmers can implement systems of any speed. It helps to improve all experiences, user and programmer alike.

After the straightforward critique of Node.js for data processing, the article goes over the deep end.

A long time ago, the original neckbeards decided that it was a good idea to chain together small programs that each performed a specific task, and that the universal interface between them should be text.

This statement, while true, ignores the change in perspective between then and now. As long as your process is serial and the input and output is well defined, text pipes are generally ok. Once you start having more complex data or want to run multiple processes at once, serialized text interfaces become more difficult to work with.

If you develop on a Unix platform and you abide by this principle, the operating system will reward you with simplicity and prosperity. As an example, when web applications first began, the web application was just a program that printed text to standard output. The web server was responsible for taking incoming requests, executing this program, and returning the result to the requester. We called this CGI, and it was a good way to do business until the micro-optimizers sank their grubby meathooks into it.

I’m not sure but I think he just said that any web application that doesn’t use a CGI interface is sunken full of grubby micro-optimizer meathooks. This pretty much defines all web applications written in the past 6 years, after PHP fell out of popularity. One of the reasons CGI has fallen out of favour is because of long startup times while binaries load and libraries are linked/imported. The reason we run mod_php, FastCGI, SCGI, or HTTP servers is so we don’t have the pay the process startup cost for every request.

I do find it interesting that he used CGI as The Right Way to do web applications the Unix way. CGI depends on environment variables more than text pipes to send data between web server and web application.

Conceptually, this is how any web application architecture that’s not cancer still works today: you have a web server program that’s job is to accept incoming requests, parse them, and figure out the appropriate action to take. That can be either serving a static file, running a CGI script, proxying the connection somewhere else, whatever. The point is that the HTTP server isn’t the same entity doing the application work. Developers who have been around the block call this separation of responsibility, and it exists for a reason: loosely coupled architectures are very easy to maintain.

Just as in the last paragraph, now all web application architectures are cancers. I don’t know how this guy develops, but I know that Django has code to handle static files even though it’s not meant to be used in production.

I totally agree that loosely coupled architectures are very easy to maintain. What’s stopping you from writing loosely coupled Node.js services? How is that any different than any other application architecture? It’s just another tool in the toolbox.

And yet, Node seems oblivious to this. Node has (and don’t laugh, I am not making this shit up) its own HTTP server, and that’s what you’re supposed use to serve production traffic. Yeah, that example above when I called http.createServer(), that’s the preferred setup.

This part is true. I think the fact that the bundled HTTP server is the preferred setup is mostly an indication of the immaturity of the project, not an indictment of the effort going into it. Your Node.js application does not need to run an HTTP server. I’d argue that a Node.js application should be written such that it can have multiple interfaces: HTTP, message queue, pipes (wait-a-minute…). This is an example of a loosely coupled architecture. I feel like this has been mentioned before. Oh yeah, it has because it’s very easy to maintain.

Many other web frameworks (Rails, Django, Pyramid/Pylons/Turbogears) include their own HTTP servers. I don’t understand how this is somehow something that makes node different and worthy of ridicule.

If you search around for “node.js deployment”, you find a bunch of people putting Nginx in front of Node, and some people use a thing called Fugue, which is another JavaScript HTTP server that forks a bunch of processes to handle incoming requests, as if somebody maybe thought that this “nonblocking” snake oil might have an issue with CPU-bound performance.

Fugue isn’t actually another JavaScript HTTP server. It’s just a library that uses the http module to accept requests and send them to child processes (usually one per cpu). In fact, there is only one HTTP server in that whole bunch: the main process. All the child processes just accept requests that were originally captured by the main fugue process. Because they both implement the same API, your application doesn’t have to be written to support fugue.

Since Node.js uses an event loop, it’s single threaded by default. Fugue is an easy way to take a single threaded application that works great and spread it across multiple cores. Sounds like a great separation of responsibility to me: don’t make your application worry about multiple processors itself, let another library worry about it.

If you’re using Node, there’s a 99% probability that you are both the developer and the system administrator, because any system administrator would have talked you out of using Node in the first place. So you, the developer, must face the punishment of setting up this HTTP proxying orgy if you want to put a real web server in front of Node for things like serving statics, query rewriting, rate limiting, load balancing, SSL, or any of the other futuristic things that modern HTTP servers can do. That, and it’s another layer of health checks that your system will need.

Aren’t developers who are also system administrators called devops? Don’t they exist in places like Flickr, Etsy, and Amazon? I’m not sure that system administrators would try to convince you not to use Node, unless they’re the same ones that think CGI is a viable deployment architecture.

I’m also not sure what adds “another layer of health checks”. Does he mean nginx or node? Any layer you add to your application needs to be health checked. Including CGI.

The author also complains that “It’s Fucking JavaScript”. If you hate JavaScript that much, then don’t use node.js. If you like the idea of event loops and node.js but don’t like the code example he offers, perhaps you should look at CoffeeScript or Closure.

His eventual conclusion is that node.js is unpleasant software and he won’t use it. That’s fine with me. My problem is that I don’t understand what he does use. He’s really not very clear about what is better than node, except that it follows the Unix Way and is most probably CGI. I’m curious how he addresses slow process startup times, parallel workloads, multiple cores, and communication between loosely coupled services within a larger product architecture. Maybe he just writes perl, ignores process metrics, runs a process for each request, and implements entire web applications as a series of modules. That’s all I can conclude from this article.

 
2011
Sep
27

Nodejs chef cookbook update

I’m not a chef or nodejs expert. I’m sure there are better ways to do this, so don’t be afraid to post a comment.

Digitalbutter has a nodejs chef cookbook that installs nodejs. Since nodejs is so new, you need to build it from source on most distros to get the most recent version. It also supports installing npm dependencies and configuring a nodejs service using upstart. All fairly useful functionality as devops goes.

Since starting a new project and integrating vagrant into it, I noticed a limitation with the cookbook: it builds nodejs from a github clone every time it runs. If you’re running this periodically on a powerful server, it’s probably not that big of a deal. It is a big deal if you’re running it in a virtual machine every time you want to do some quick development. Every time I ran vagrant up, I had to wait 5 to 10 minutes for nodejs to build. This just wouldn’t do.

I had two problems that needed solving: how do I know what version of nodejs is currently installed and how do I stop the process as quickly as possible? The former was easy; the latter a little more difficult.

At the end of a successful build, I saved the output of git show -s --format=%H to /usr/local/share/node\_version. Now I always knew the hash of the last successful build.

Initially, I compared the working directory hash with the previously built hash and didn’t build if they matched. This shaved 3-8 minutes off the entire chef-solo run. Most of the time spent on this recipe was then being spent during the actual clone. It seemed that /tmp was cleared when vagrant restarted, so the clone itself didn’t exist anymore and needed to be re-downloaded.

How can we see what hash a particular ref points at without doing a full clone? Back to the git manpages we go. git ls-remote can connect to any remote repository and give us the hashes and names of every remote ref, including ones you might not want to see (remember that a “ref” is anything inside the .git/refs directory; notes are stored in refs/notes, for example). Luckily there exist -t and -h options to only show tags and heads. This workes great for any head, because the head ref pointed directly at the commit. This didn’t work so well in the case of tags. Tags can be objects too, which have different hashes than the commits they point at.

Back to the git manpages. This time I looked at git-rev-parse and found that ref^{} means dereference a tag to the commit it points at. This works with git ls-remote so with an extra call we are able to check the ref itself and the tag reference without even cloning the repo.

The commit that implements this is uploaded to my fork.

 
Older
Newer