Migrating to GIT with Reposurgeon



I’ve recently worked a lot with reposurgeon, a tool by Eric S. Raymond to do surgery on version control data. With this tool it is possible to migrate from almost any version control system to almost any other version control system — although these days the most feature-complete system is GIT which is the recommended and best supported target system (and the only one I’ve tested). It comes with a migration guide, the DVCS migration HOWTO.

Beyond just converting data, reposurgeon can be used to clean up artefacts, the simplest of which is to reformat commit comments to conform to established standards on GIT commit messages.

My conversion of the history of the pyst project, a python library to connect to different network interfaces of the Asterisk telephony engine, is a good example of what can be done with reposurgeon. The project, originally started by Karl Putland with version control at the time in CVS, was later taken over by Matthew Nicholson who used Monotone for version control. When I took over maintenance in 2010, I used Subversion. So we had three separate source code repositories in different version control systems. No effort was ever made to convert the version history from one system to another, so each new maintainer imported the last release version into the new version control system and continued from there.

Fortunately, Monotone has a GIT export feature and reposurgeon can natively read the other two formats. So I used separate reposurgeon scripts to clean up the three repositories and then used the reposurgeon graft command to unite them into one. The Subversion repo started with release 0.2 but there had been some commits after 0.2 in Monotone (which were later merged in Subversion) so the commits after 0.2 were put on a branch in the new repository. The last step then was to write out the resulting repository in GIT fast-export format and import into a GIT repository.

What artefacts did I clean up? Let me give two examples, both of which are problems I have when using Subversion.

Subversion doesn’t have the concept of a tag like other version control systems have. Instead tags are emulated by copying the to-be-tagged content to a new location in the repository, effectively creating a new branch, it’s just a naming convention that this is called a tag.

The first example deals with last-minute changes when doing a release: It’s not possible in Subversion to really remove a tag as in other systems. So when doing a release in GIT and something in the release process doesn’t work (not having created a release yet), as long as we didn’t push our changes to the public repository we can still move the release tag. This isn’t possible in Subversion. So I frequently have the situation that a release tag isn’t just one commit but several and the changes are either merged from the trunk to the tag-branch or the other way round from the tag-branch to the trunk. An example of a commit on a tag (r22 “fix PACKAGE definition for SF release”) and subsequent merge back to trunk can be seen in the following illustration created with the graph command of reposurgeon. The tag, originally Subversion commit r21 has already been tagified by reposurgeon. But the commit on the tag is now on the branch V_0_3.

First example

This can be fixed with reposurgeon with the following commands:


debranch V_0_3 trunk
[/^V_0_3\//] paths sup
:70 unmerge
:70 tagify –canonicalize
tag emptycommit-23 delete
tag V_0_3-root rename V_0_3
tag V_0_3 move :69

This puts the branch V_0_3 back onto the trunk and creates a new subdirectory V_0_3 there. Then this subdirectory is removed with the paths reposurgeon command. We then make the merge commit a normal commit with only one predecessor with unmerge and create a tag from this new commit which is possible because it doesn’t change any files. Finally we delete the tag just created and rename and move the V_0_3 tag.

The second example involves accidental commits on a release tag. This frequently happens to me when using Subversion for doing a release and happens as follows:

  • Create new tag by copying to a subdirectory in tags
  • Switch to this new tag using svn switch
  • Do the release
  • Forget to switch back to trunk
  • Come back later, do some accidental commits on the tag
  • Merge accidental changes back to trunk
  • Revert the changes on the tag

An example of this problem can be seen in the following figure.
Second example

This example is from my svnpserver project and shows a series of commits on the V_0_4 tag. Just before the next release I noticed, merged the commits from the tag to trunk, and reverted the erroneous commits on the tag. This is fixed with reposurgeon as follows:

:14063 delete
debranch V_0_4 trunk
[/^V_0_4//] paths sup
tag V_0_5-root rename V_0_5
tag V_0_5 move :14061
:14062 unmerge
:14062 tagify –canonicalize
tag emptycommit-4734 delete

First we delete the commit that reverts the changes on the tag. Then we move the commits from the tag to trunk and remove the resulting path prefix V_0_4. The new V_0_5 tag is moved to the last commit on what was previously the last commit on the tag because we’re going to eliminate the merge-commit next: First we make the merge commit a normal commit by removing the earlier ancestor using unmerge. The last step is to convert this commit into a tag (which is possible because it now doesn’t modify anything) and remove that resulting tag.

Modifying history is usually a bad idea when converting repositories. After all, the version control system is here to preserve the history. My rule is to remove artefacts of the used version control system that would never have occurred with another system. All the problems above would have been avoided by using, e.g., GIT in the first place: With GIT we can simply move a tag (if we haven’t pushed yet) and the erroneous commits on the tag could never have happened because we don’t have to switch branches for doing a release with GIT, so forgetting to switch back from the branch is not possible.

So by using reposurgeon we now have a GIT repository for pyst that spans the entire history of the project united from three different version control systems in use over the duration of the project.