Git is, without question, the most popular version control system in software development. While most developers run Git from the command line, the GitPython library makes it possible to perform Git operations in code. This article explores the library's classes and methods and shows how they can be accessed in code.
I've spent several years programming in object-oriented languages, and my brain has become wired to think about problems in terms of classes, objects, methods, and properties. I don't mind performing simple actions on the command line, but for complex operations, I prefer to write code.
When it comes to version control, I appreciate the flexibility and performance of Git, but I'd never call it simple. For this reason, I like to interact with repositories programmatically using the GitPython library. This reduces the amount of text I have to enter on the command line, and it allows me to integrate Git operations into regular programs.
Another advantage of GitPython is that it provides a deeper understanding of Git. The library's classes and methods clarify how Git really works. The goal of this article is to explain how to code with these classes and methods, and the first section presents the fundamental classes. The second section discusses the important topic of submodules, which enable one repository to include other repositories.
1. Fundamental Classes
According to this Github page, the GitPython library was developed by Sebastian Thiel and Harmon. They have graciously released their package under the revised BSD 3 license, and if you have Python installed, you can install GitPython with the following command:
pip install gitpython
Once you've installed the package and imported it in a Python script, you can access its data structures. This section looks on five fundamental classes:
Repo
– represents a Git repository Reference
– superclass of repository references (Head
, Tag
, RemoteReference
) Remote
– represents a remote repository - IndexFile – represents the repository's staging area
Commit
– represents a Git repository
Once you have a solid understanding of these classes, you'll have no trouble implementing complex Git operations in Python.
1.1 The Repo Class
In general, the first class used in a GitPython program is the Repo
class, which represents a repository. There are two main ways to create a Repo
instance:
- To clone a repository, call
Repo.clone_from(...)
with the repository's URL and the name of the local directory to hold the cloned files. - To access a local repository, call
Repo(...)
with the path to the local repository.
The following code demonstrates how both of types of Repo
objects can be created. It checks to see if a directory named local_dir
exists, and if it doesn't, it clones the repository at the URL given by remote_url
. If the local_dir
folder exists, it creates a Repo
from the directory.
if not os.path.isdir(local_dir):
repo = Repo.clone_from(remote_url, local_dir)
else:
repo = Repo(local_dir)
Once a Repo
object has been created, its members can be accessed in code. Table 1 lists twelve of the properties that can be read:
Table 1: Properties of the Repo Class Property | Data Type | Description |
working_tree_dir | str | Path of the repo's working directory |
common_dir | str | Path of the repo's Git (.git) folder |
bare | bool | Identifies if the repo is bare |
untracked_files | List[str] | Unstaged files in the working directory |
refs | List[Reference] | List of Reference objects |
heads | List[Head] | The repo's branch heads |
head | HEAD | Pointer to the current head reference |
branches | List[Head] | List of Head objects representing the branch heads |
active_branch | Head | Name of the current branch |
tags | List[Tag] | List of the repo's tags |
remotes | List[Remote] | List of Remote objects |
index | IndexFile | Represents the repo's staging area |
The first four properties are easy to understand. working_tree_dir
identifies the path of the working directory and common_dir
identifies the path to the Git database (.git folder). The bare
property identifies if the repository is bare (lacks a working directory). If the repo has a working directory containing unstaged files, the paths of the files can be accessed through the untracked_files
property.
The fifth property, refs
, is particularly important because it provides access to all the references in the repo. I'll discuss the important Reference
class and its subclasses shortly.
In addition to the properties listed, the Repo
class provides several methods that access and modify the repository. Many of these methods start with create
or delete
:
create_head
/create_tag
– adds a new reference of the given type delete_head
/delete_tag
– deletes a reference of the given type
The Repo
class doesn't have create_branch
or delete_branch
methods. As the next section will make clear, branches are represented by instances of the Head
class.
1.2 The Reference Class and Subclasses
To understand the Reference
class, it helps to be familiar with the structure of a Git repository. If you look in a repository's .git/refs folder, you'll find three subfolders:
- heads - contains references to the heads of branches
- tags - contains references to tags
- remotes - contains references to remote repositories
The refs
property of a Repo
is a list of Reference
objects. The subclasses of Reference
resemble the subfolders of the .git/refs directory, so each Reference
is either a Head
, Tag
, or RemoteReference
object. The following discussion explores each of these subclasses.
1.2.1 The Head Class
Instead of a Branch
class, GitPython has a Head
class that represents the branches in a repository. You can obtain a list of the Head
objects by calling the heads
method of a Repo
instance, and every Head
has two properties:
name
– a string identifying the name of the branch commit
– the latest Commit
object of the branch
Every repository has a single branch called the active branch, and you can access the Head
for the current branch through the active_branch
method of the Repo
. To demonstrate, the following code accesses the active branch and prints its name:
print(repo.active_branch.name)
The Head
class provides four useful methods:
checkout(force: bool=False)
- switches to the branch rename(str, force: bool=False)
- change's the branch's name set_tracking_branch(Remote)
- sets the branch to track the remote branch tracking_branch()
- returns the Remote
tracked by the branch or None
The easiest way to create and delete branches is to call the Repo
's create_head
and delete_head
methods. To demonstrate how they're used, the following code creates a branch, checks it out, and then deletes it.
old_branch = repo.active_branch
new_branch = repo.create_head('new_branch')
new_branch.checkout()
print('The active branch is ' + repo.active_branch.name)
old_branch.checkout()
print('Now the active branch is ' + repo.active_branch.name)
repo.delete_head(new_branch)
The set_tracking_branch
and tracking_branch
methods relate to RemoteReference
instances, which will be discussed shortly.
1.2.2 The HEAD Class
It may seem confusing, but GitPython provides a HEAD
class in addition to the Head
class, and neither is a subclass of the other. While the current Head
can be accessed with the Repo
's active_branch
property, the HEAD can be accessed with the Repo
's head
property.
A Head
can represent any branch, but HEAD
always points to the latest commit in the current branch. It has three properties:
abspath
– location of the HEAD file in the .git directory commit
– the Commit
object referenced by HEAD is_detached
– a bool that identifies if HEAD is detached (points to a commit instead of a branch)
In addition, HEAD
provides a method named orig_head()
, which points to the previous value of HEAD
.
1.2.3 The Tag Class
A tag is a name assigned to a specific commit, and tags become particularly helpful when assigning version numbers. In code, a tag can be created with the Repo
's create_tag
method, which accepts a string and a Commit
. Tags can be deleted with the Repo
's delete_tag
method.
1.2.4 The RemoteReference Class
Despite its name, a RemoteReference
represents a branch of a remote repository, not the repository itself. You can set a local branch to track a remote branch by calling the set_tracking_branch
method with a RemoteReference
. As far as I can tell, this is the only purpose the RemoteReference
serves.
There's no way to create a RemoteReference
directly. Instead, you'll need to access a remote repository through the Remote
class. The following section discusses this class in detail.
1.3 The Remote Class
A remote repository (or just remote) is a version of the codebase stored on a server. When you clone a repository, Git automatically creates a remote named origin
and associates it with the repository's URL. Just as the git branch
command lists the branches of the local repository, git remote
lists the remotes available for tracking.
In GitPython, remote repositories are represented by instances of the Remote
class, which is not a subclass of Reference
. A Remote
can be created by calling the create_remote
method of the Repo
class and deleted by calling delete_remote
. The remote
method returns the Remote
with the given name and remotes
returns a list of Remote
instances.
To demonstrate this, the following code creates a branch and a remote. Then it obtains a list of the remote's branches by calling the refs
method of the Remote
instance. If this list isn't empty, the last line configures the new branch to track the first remote branch.
new_branch = repo.create_head('new_branch')
new_remote = repo.create_remote('new_remote',
'https://github.com/mattscar/opencl_book.git')
if new_remote.refs:
new_branch.set_tracking_branch(new_remote.refs[0])
The Remote
class provides three particularly important methods that transfer data between a local branch and the remote repository:
fetch(...)
– downloads updates from the remote repository to the local repository without affecting the working directory pull(...)
– downloads updates from the remote repository to the local repository and merges them into the current branch push(...)
– uploads commits from the local repository to the remote repository
Each of these methods accepts optional parameters and returns a data structure that provides information about the operation. You can find more detailed information on the GitPython documentation site.
1.4 The IndexFile Class
Before performing a commit, Git stores updates in its staging area. In the .git folder, the index file serves as the staging area. For this reason, GitPython provides the IndexFile
class to represent the repository's staging area. This can be accessed in code through the index
property of the Repo
instance.
The IndexFile
class has a property called entries
that maps an IndexEntry
to index data for each tracked file in the repository. You can change the staging area by updating this dictionary.
IndexFile
also provides several methods that modify the staging area or move data between the working directory, staging area, and local repository. Table 2 lists eight of these methods.
Table 2: Methods of the IndexFile Class Method | Description |
add(...) | Add files from the working tree to the index |
checkout(...) | Check out the files/paths into the working tree |
commit(...) | Commits the index file and creates a Commit instance |
diff(...) | Compares the index to the working copy or Commit |
move(...) | Renames/moves the given items |
remove(...) | Removes items from the index and optionally from the working tree |
reset(...) | Resets the index to reflect the tree at the given commit |
update() | Rereads the index file and discards cached information |
To demonstrate how these methods can be used, the following code creates a text file in the working directory, adds it to the staging area, commits the index file, and pushes it to the remote repository.
index = repo.index
remote = repo.remote()
file_name = os.path.join(repo.working_tree_dir, 'example.txt')
with open(file_name, 'w') as example_file:
example_file.write('example')
index.add(file_name)
index.commit(message='Example commit')
remote.push()
The IndexFile
also provides methods that interact with blobs and trees. You can find a full description of this class in the GitPython documentation.
1.5 The Commit Class
A Git repository stores committed changes and associates each with a secure hash (SHA-1) value. In GitPython, each commit is represented by an instance of the Commit
class. You can create a Commit
by calling the commit
method of the IndexFile
with a message. You can also access Commit
s by calling the commit
method of a Repo
instance.
The Commit
class has several properties and Table 3 lists eleven of them.
Table 3: Properties of the Commit Class Property | Data Type | Description |
name_rev | str | The SHA-1 hash identifier for the commit |
message | str | The commit message |
encoding | str | Encoding of the message (UTF-8 by default) |
summary | str | The first line of the commit message |
stats | Stats | Information about the commit |
author | str | The commit's author |
authored_date | int | Time of the author |
author_tz_offset | int | Time zone offset of the author |
committer | str | The committer string |
committed_date | int | The date of the commit |
committer_tz_offset | int | Time zone offset of the committer |
The stats
property has a field called total
that has four fields:
insertions
– Number of inserted lines deletions
– Number of deleted lines lines
– Number of lines changed files
– Number of files changed
To demonstrate how this property can be used, the following code accesses the latest commit of the repository and prints its statistics:
commit = repo.commit()
st = commit.stats
print(st.total)
The last line of code displays the number of inserted lines, deleted lines, changed lines, and changed files.
2. Submodules
Submodules make it possible to access external repositories as directories in the main repository. Many developers don't use submodules, but they can dramatically improve modularity, collaboration, and code reusability.
In GitPython, submodules are represented by instances of the Submodule
class. You can create a submodule for a repository by calling the create_submodule
method of the Repo
instance. This accepts several parameters and Table 4 lists them all.
Table 4: Submodule Creation Parameters Parameter | Type | Description |
name | str | Identifier for the submodule |
path | str | Relative/absolute path where the submodule should be stored |
url | URL | URL of the submodule's repository |
branch | str | Name of the submodule's repo branch to be checked out |
no_checkout | bool | Whether the submodule's repo branch should be checked out |
depth | int | Number of commits to be downloaded |
env | dict | Dictionary of environment variables for the submodule |
clone_multi_options | list | Options used during the clone operation |
allow_unsafe_protocols | bool | Whether unsafe protocols can be used |
allow_unsafe_options | bool | Whether unsafe options can be used |
An example will clarify how submodules are created. The following code creates a submodule named submod
from the main
branch of the repository at http://github.com/submod.git
. The repository will be accessed from a directory named submod
:
path = os.path.join(repo.working_tree_dir, 'submod')
submod = repo.create_submodule('submod', path, 'https://github.com/submod.git', 'main')
In this code, the submod
variable is an instance of the Submodule
class. This provides many useful properties, including branch_name
, branch_path
, url
, and parent_commit
. It also has many useful methods, including move
, remove
, and update
. The children
method provides a list of the submodule's submodules.
One particularly useful method is module
, which converts the Submodule
into a new Repo
instance. Once this is called, an application can use regular Repo
methods to perform Git operations on the submodule.
3. History
This article was initially submitted on August 20, 2024.