Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / Python

Performing Git Operations in Python

0.00/5 (No votes)
20 Aug 2024CPOL11 min read 2.8K  
This article explains how to access Git repositories in Python using the GitPython library.
Git is, without question, the most popular version control system in software development. While most developers run Git from the command line, the GitPython library makes it possible to perform Git operations in code. This article explores the library's classes and methods and shows how they can be accessed in code.

I've spent several years programming in object-oriented languages, and my brain has become wired to think about problems in terms of classes, objects, methods, and properties. I don't mind performing simple actions on the command line, but for complex operations, I prefer to write code.

When it comes to version control, I appreciate the flexibility and performance of Git, but I'd never call it simple. For this reason, I like to interact with repositories programmatically using the GitPython library. This reduces the amount of text I have to enter on the command line, and it allows me to integrate Git operations into regular programs.

Another advantage of GitPython is that it provides a deeper understanding of Git. The library's classes and methods clarify how Git really works. The goal of this article is to explain how to code with these classes and methods, and the first section presents the fundamental classes. The second section discusses the important topic of submodules, which enable one repository to include other repositories.

1. Fundamental Classes

According to this Github page, the GitPython library was developed by Sebastian Thiel and Harmon. They have graciously released their package under the revised BSD 3 license, and if you have Python installed, you can install GitPython with the following command:

pip install gitpython

Once you've installed the package and imported it in a Python script, you can access its data structures. This section looks on five fundamental classes:

  1. Repo – represents a Git repository
  2. Reference – superclass of repository references (Head, Tag, RemoteReference)
  3. Remote – represents a remote repository
  4. IndexFile – represents the repository's staging area
  5. Commit – represents a Git repository

Once you have a solid understanding of these classes, you'll have no trouble implementing complex Git operations in Python.

1.1  The Repo Class

In general, the first class used in a GitPython program is the Repo class, which represents a repository. There are two main ways to create a Repo instance:

  1. To clone a repository, call Repo.clone_from(...) with the repository's URL and the name of the local directory to hold the cloned files.
  2. To access a local repository, call Repo(...) with the path to the local repository.

The following code demonstrates how both of types of Repo objects can be created. It checks to see if a directory named local_dir exists, and if it doesn't, it clones the repository at the URL given by remote_url. If the local_dir folder exists, it creates a Repo from the directory.

Python
if not os.path.isdir(local_dir):
    repo = Repo.clone_from(remote_url, local_dir)
else:
    repo = Repo(local_dir)

Once a Repo object has been created, its members can be accessed in code. Table 1 lists twelve of the properties that can be read:

Table 1: Properties of the Repo Class
Property Data Type Description
working_tree_dir str Path of the repo's working directory
common_dir str Path of the repo's Git (.git) folder
bare bool Identifies if the repo is bare
untracked_files List[str] Unstaged files in the working directory
refs List[Reference] List of Reference objects
heads List[Head] The repo's branch heads
head HEAD Pointer to the current head reference
branches List[Head] List of Head objects representing the branch heads
active_branch Head Name of the current branch
tags List[Tag] List of the repo's tags
remotes List[Remote] List of Remote objects
index IndexFile Represents the repo's staging area

The first four properties are easy to understand. working_tree_dir identifies the path of the working directory and common_dir identifies the path to the Git database (.git folder). The bare property identifies if the repository is bare (lacks a working directory). If the repo has a working directory containing unstaged files, the paths of the files can be accessed through the untracked_files property.

The fifth property, refs, is particularly important because it provides access to all the references in the repo. I'll discuss the important Reference class and its subclasses shortly.

In addition to the properties listed, the Repo class provides several methods that access and modify the repository. Many of these methods start with create or delete:

  • create_head/create_tag – adds a new reference of the given type 
  • delete_head/delete_tag – deletes a reference of the given type 

The Repo class doesn't have create_branch or delete_branch methods. As the next section will make clear, branches are represented by instances of the Head class.

1.2  The Reference Class and Subclasses

To understand the Reference class, it helps to be familiar with the structure of a Git repository. If you look in a repository's .git/refs folder, you'll find three subfolders:

  • heads - contains references to the heads of branches
  • tags - contains references to tags
  • remotes - contains references to remote repositories

The refs property of a Repo is a list of Reference objects. The subclasses of Reference resemble the subfolders of the .git/refs directory, so each Reference is either a Head, Tag, or RemoteReference object. The following discussion explores each of these subclasses.

1.2.1  The Head Class

Instead of a Branch class, GitPython has a Head class that represents the branches in a repository. You can obtain a list of the Head objects by calling the heads method of a Repo instance, and every Head has two properties:

  1. name – a string identifying the name of the branch
  2. commit – the latest Commit object of the branch

Every repository has a single branch called the active branch, and you can access the Head for the current branch through the active_branch method of the Repo. To demonstrate, the following code accesses the active branch and prints its name:

Python
print(repo.active_branch.name)

The Head class provides four useful methods:

  • checkout(force: bool=False) - switches to the branch
  • rename(str, force: bool=False) - change's the branch's name
  • set_tracking_branch(Remote) - sets the branch to track the remote branch
  • tracking_branch() - returns the Remote tracked by the branch or None

The easiest way to create and delete branches is to call the Repo's create_head and delete_head methods. To demonstrate how they're used, the following code creates a branch, checks it out, and then deletes it.

Python
# Access the active branch
old_branch = repo.active_branch

# Create a new branch and check it out
new_branch = repo.create_head('new_branch')
new_branch.checkout()
print('The active branch is ' + repo.active_branch.name)

# Check out the old branch
old_branch.checkout()
print('Now the active branch is ' + repo.active_branch.name)

# Delete the new branch
repo.delete_head(new_branch)

The set_tracking_branch and tracking_branch methods relate to RemoteReference instances, which will be discussed shortly.

1.2.2  The HEAD Class

It may seem confusing, but GitPython provides a HEAD class in addition to the Head class, and neither is a subclass of the other. While the current Head can be accessed with the Repo's active_branch property, the HEAD can be accessed with the Repo's head property.

A Head can represent any branch, but HEAD always points to the latest commit in the current branch. It has three properties:

  • abspath – location of the HEAD file in the .git directory
  • commit – the Commit object referenced by HEAD
  • is_detached – a bool that identifies if HEAD is detached (points to a commit instead of a branch) 

In addition, HEAD provides a method named orig_head(), which points to the previous value of HEAD.

1.2.3  The Tag Class

A tag is a name assigned to a specific commit, and tags become particularly helpful when assigning version numbers. In code, a tag can be created with the Repo's create_tag method, which accepts a string and a Commit. Tags can be deleted with the Repo's delete_tag method.

1.2.4  The RemoteReference Class

Despite its name, a RemoteReference represents a branch of a remote repository, not the repository itself. You can set a local branch to track a remote branch by calling the set_tracking_branch method with a RemoteReference. As far as I can tell, this is the only purpose the RemoteReference serves.

There's no way to create a RemoteReference directly. Instead, you'll need to access a remote repository through the Remote class. The following section discusses this class in detail.

1.3  The Remote Class

A remote repository (or just remote) is a version of the codebase stored on a server. When you clone a repository, Git automatically creates a remote named origin and associates it with the repository's URL. Just as the git branch command lists the branches of the local repository, git remote lists the remotes available for tracking.

In GitPython, remote repositories are represented by instances of the Remote class, which is not a subclass of Reference. A Remote can be created by calling the create_remote method of the Repo class and deleted by calling delete_remote. The remote method returns the Remote with the given name and remotes returns a list of Remote instances.

To demonstrate this, the following code creates a branch and a remote. Then it obtains a list of the remote's branches by calling the refs method of the Remote instance. If this list isn't empty, the last line configures the new branch to track the first remote branch.

Python
# Create a new branch
new_branch = repo.create_head('new_branch')

# Create a new remote
new_remote = repo.create_remote('new_remote', 
    'https://github.com/mattscar/opencl_book.git')

# Set the branch to track the first remote branch
if new_remote.refs:
    new_branch.set_tracking_branch(new_remote.refs[0])

The Remote class provides three particularly important methods that transfer data between a local branch and the remote repository:

  • fetch(...) – downloads updates from the remote repository to the local repository without affecting the working directory
  • pull(...) – downloads updates from the remote repository to the local repository and merges them into the current branch
  • push(...) – uploads commits from the local repository to the remote repository

Each of these methods accepts optional parameters and returns a data structure that provides information about the operation. You can find more detailed information on the GitPython documentation site.

1.4  The IndexFile Class

Before performing a commit, Git stores updates in its staging area. In the .git folder, the index file serves as the staging area. For this reason, GitPython provides the IndexFile class to represent the repository's staging area. This can be accessed in code through the index property of the Repo instance.

The IndexFile class has a property called entries that maps an IndexEntry to index data for each tracked file in the repository. You can change the staging area by updating this dictionary.

IndexFile also provides several methods that modify the staging area or move data between the working directory, staging area, and local repository. Table 2 lists eight of these methods.

Table 2: Methods of the IndexFile Class
Method Description
add(...) Add files from the working tree to the index
checkout(...) Check out the files/paths into the working tree
commit(...) Commits the index file and creates a Commit instance
diff(...) Compares the index to the working copy or Commit
move(...) Renames/moves the given items
remove(...) Removes items from the index and optionally from the working tree
reset(...) Resets the index to reflect the tree at the given commit
update() Rereads the index file and discards cached information

To demonstrate how these methods can be used, the following code creates a text file in the working directory, adds it to the staging area, commits the index file, and pushes it to the remote repository.

Python
# Access the index file and remote repository
index = repo.index
remote = repo.remote()

# Create text file and add it to the staging area
file_name = os.path.join(repo.working_tree_dir, 'example.txt')
with open(file_name, 'w') as example_file:
    example_file.write('example')

    # Update the staging area
    index.add(file_name)

    # Commit the update
    index.commit(message='Example commit')

    # Push the commit
    remote.push()

The IndexFile also provides methods that interact with blobs and trees. You can find a full description of this class in the GitPython documentation.

1.5  The Commit Class

A Git repository stores committed changes and associates each with a secure hash (SHA-1) value. In GitPython, each commit is represented by an instance of the Commit class. You can create a Commit by calling the commit method of the IndexFile with a message. You can also access Commits by calling the commit method of a Repo instance.

The Commit class has several properties and Table 3 lists eleven of them.

Table 3: Properties of the Commit Class
Property Data Type Description
name_rev str The SHA-1 hash identifier for the commit
message str The commit message
encoding str Encoding of the message (UTF-8 by default)
summary str The first line of the commit message
stats Stats Information about the commit
author str The commit's author
authored_date int Time of the author
author_tz_offset int Time zone offset of the author
committer str The committer string
committed_date int The date of the commit
committer_tz_offset int Time zone offset of the committer

The stats property has a field called total that has four fields:

  • insertions – Number of inserted lines
  • deletions – Number of deleted lines
  • lines – Number of lines changed
  • files – Number of files changed

To demonstrate how this property can be used, the following code accesses the latest commit of the repository and prints its statistics:

Python
# Access the latest commit
commit = repo.commit()

# Access and print the commit statistics
st = commit.stats
print(st.total)

The last line of code displays the number of inserted lines, deleted lines, changed lines, and changed files.

2. Submodules

Submodules make it possible to access external repositories as directories in the main repository. Many developers don't use submodules, but they can dramatically improve modularity, collaboration, and code reusability.

In GitPython, submodules are represented by instances of the Submodule class. You can create a submodule for a repository by calling the create_submodule method of the Repo instance. This accepts several parameters and Table 4 lists them all.

Table 4: Submodule Creation Parameters
Parameter Type Description
name str Identifier for the submodule
path str Relative/absolute path where the submodule should be stored
url URL URL of the submodule's repository
branch str Name of the submodule's repo branch to be checked out
no_checkout bool Whether the submodule's repo branch should be checked out
depth int Number of commits to be downloaded
env dict Dictionary of environment variables for the submodule
clone_multi_options list Options used during the clone operation
allow_unsafe_protocols bool Whether unsafe protocols can be used
allow_unsafe_options bool Whether unsafe options can be used

An example will clarify how submodules are created. The following code creates a submodule named submod from the main branch of the repository at http://github.com/submod.git. The repository will be accessed from a directory named submod:

Python
# Set path for the submodule
path = os.path.join(repo.working_tree_dir, 'submod')

# Create submodule
submod = repo.create_submodule('submod', path, 'https://github.com/submod.git', 'main')

In this code, the submod variable is an instance of the Submodule class. This provides many useful properties, including branch_namebranch_path, url, and parent_commit. It also has many useful methods, including move, remove, and update. The children method provides a list of the submodule's submodules.

One particularly useful method is module, which converts the Submodule into a new Repo instance. Once this is called, an application can use regular Repo methods to perform Git operations on the submodule.

3. History

This article was initially submitted on August 20, 2024.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)