A friend of mine has been telling me for some time that I should try using Git / GitHub to keep track of my files. In this post I’ll give a bit of an overview of these (or at least my understanding of them so far!), and some of my experiences of using them so far. Some of the terminology is (in my experience) a bit confusing, so I’ll attempt to give an intuitive introduction. For any Git experts out there, please put me straight (in a comment) if I’ve got something wrong!
Git and GitHub
Git is described as a ‘distributed version control system’. The basic idea is as follows. You start with a folder (directory) on your computer, and tell Git that you want to set up a new repository there. Next, you tell Git which files in the folder you want it to keep track of. You then make some changes to the files. At some point, of your choosing, you ‘commit’ these changes, at which point you give a text description of the changes you have made. At the point of the commit, Git compares the current state of the tracked files to the previous version, and notes the differences. It is the changes/differences to the files that Git has (in a special folder) kept a record of.
The obvious (and simplest) benefit of doing the above, compared to just editing your set of files on a continuous basis, is that you have a traceable history of the changes you have made to your files, along with your text comments describing the changes made at each stage (each commit). Furthermore, before you commit your new changes, if you decide you want to roll back the edits you’ve just made, you can ask Git to revert any files back to their previous state (as after the last commit).
Suppose that you want to develop a new aspect/module to your program, or data analysis, or whatever it is you’re working on in your repository. You could do this by modifying the files in your folder, and committing the changes as you go along. Suppose that half way through this process, you decide that the new aspect/module you’ve worked on is actually not a great idea, and you want to revert back to where you were before you embarked on the changes. At this point, my natural guess is that Git would allow you to roll back to this earlier version. Although this is possible, a better approach would have been to do you development work on a new branch.
So, we ask Git to generate a new branch. What this means is that when we then proceed to make changes (i.e. adding our new aspect/module), the master version of our project (which itself is a branch) remains as it was, i.e. unchanged. If you like, when we created the new development branch, a snapshot of our master repository was taken, and is stored in Git’s files. We can now proceed to complete our new additions to the repository.
Suppose, as before, that we decide to give up on the new developments we were adding to our repository. In this case we simply ask Git to delete the development branch. We then return to the master branch, and the files in our repository revert to their original state, i.e. before we created the development branch. Alternatively, let’s suppose we go on and complete the new additions to our repository. We decide that we like them, and want them to remain. In this case the development branch, which we have been working on, must now be merged with the master branch. In essence, the new additions we have made on our development branch are merged into the original version of our files on the master branch.
GitHub and collaboration
So far, everything we’ve talked about can be performed on one computer, or locally. But Git and GitHub gain substantially when used with a central online repository. GitHub is a website which allows you to do this. Specifically, it allows you to setup online versions of your repositories. This has a number of advantages, as well as the obvious one that you have a remote third party backup of your repository (or at least the version as per your last update to it). But there are many other potential benefits. If you have collaborators who you wish to contribute to the repository, they can. There are a number of different ways of collaborating using Git/GitHub, and I’d recommend this page to read further about the different approaches.
Using GitHub to host your repositories is also attractive if you want to make the files available publicly. In fact, the default free account at GitHub only allows public repositories, and you have to pay if you want them to host repositories privately. However, particularly for package developers for say R or Stata, it may be a distinct advantage to use GitHub and have your package files available as a public repository.
Using Git/GitHub for R/Stata package development
The initial reason I recently started using Git and GitHub was while developing a new R package. To do this, I made extensive use of Hadley Wickham’s excellent website (and accompanying ’book’). As well as making the actual development of an R package a lot, lot easier, Hadley explains why using Git and GitHub can be advantageous when developing an R (or other software) package:
- Sharing/distributing your package is easy. R users can directly install your package from GitHub. Even if you host your package on CRAN, GitHub allows users to install the current development version of your package, containing the latest bug fixes and updates, before this is available on CRAN.
- GitHub allows users or collaborators to post issues they have discovered. As well as being useful to enable users to report bugs, you and your collaborators can use issues to make notes of things which are to be added or fixed in the future.
- Free website: GitHub will turn your public repository into a website (if you wish), which you can use to give information to users about your package.
- RStudio, a GUI for R, has Git and GitHub integrated into it, meaning that you can perform most of the Git functions you need from within it.
I will admit that at first (and certainly still now to some extent) I struggled with the overwhelming quantity of terminology in Git. However, once I had got my repository set up and my basic work flow sorted out, which is essentially just committing changes and pushing them to my GitHub repository, it was pretty easy. I am aware though that I still have a lot to learn, and I know that the way I am using it is not following best practice (I have so far only used a development branch once, for example!).
Using Git/GitHub for statistical analyses
So far I’ve only used Git/GitHub for development of packages, which is akin to a computer programming development task. I can also see the potential for it to be used in the context of performing statistical analyses. In most stats packages, one uses a script file, which contains the commands needed to perform analyses of the data in question. For all except the simplest of data analyses, these scripts often become quite long and complex. Historically my personal approach to this was to make a new copy of my script file every day, so that I could always roll back to a previous version. For most situations this may be entirely satisfactory. However, in the common situation where multiple people are involved in the analysis, I can see that it could be useful to allow these different collaborators to work on the script independently, on separate branches, and then to merge them back in to the central online repository.
Using Git/GitHub for writing scientific papers
Most people, particularly those involved in writing scientific papers, will be familiar with Microsoft Word’s track changes system. This is of course very useful, but when, as is often the case, multiple co-authors want to make additions and changes to a draft scientific paper, it is not ideal. I have been involved papers where the primary author receives multiple copies of the original Word file, each with the correspond collaborator’s tracked changes. It is then a formidable task to amalgamate all of these changes and additions.
I can see the potential for using Git/GitHub, or similar systems, for collaboratively working on scientific papers. One does not need to constantly construct new versions of the document with different names. All changes are commented when they are committed. Conflicts, where two people have edited the same section, would be clearly identified, allowing one to then resolve them.
There’s a good discussion of the pros/cons of this at StackExchange. The main barrier I see to using it is that there is an initial investment of time needed from each collaborator to get up to speed with the version control system, and that this just might not be worth it or feasible in some situations. Moreover, from what I’ve read Git/GitHub in particular is somewhat more technical than other competitor version control systems.
Further reading and links
I’ve found a number of resources extremely useful in learning about Git and GitHub:
- Git’s website, and particularly the Getting Started and Git Basics sections of their documentation page. I use the Git terminal shell for Windows, downloadable here, to run those Git commands which aren’t embedded in RStudio.
- The GitHub website. Sign up for a free account here, which gives you an unlimited number of public repositories. You need to use GitHub or a competitor to act as your online central repository, to enable collaboration or make your files easily available to others.
- Particularly in the context of developing an R package, Hadley Wickham’s R Packages pages.
- The tutorial pages at Atlassian. These are particularly useful because they explain the key Git concepts in extremely clear figures/diagrams.
There are of course many other great Git and GitHub resources out there too.
I’d be keen to hear from other statisticians who use Git/GitHub, or another version control system in their work.