This technical exploration delves into the inner workings of `git stash` within the isomorphic-git project. This stash implementation highlights the elegance of git object model, demonstrating the Merkle tree's power in organizing data, tracking differences, and preserving data states with core git operations.
When evaluating git integration solutions for a CDE (Cloud Development Environments), I’m impressed by the technical aspects of isomorphic-git. The codebase is well-organized, documentation is in-sync, and test suite is abundant. The project boasts a robust feature set and delivers great performance. It seemed to fit the project needs perfectly until a feature gap for stash was discovered.
The feature request for git stash has tagged 'help needed' for years, apparently bridging the gap would benefit more people. A deep dive into the git object model and the intricate details of isomorphic-git is required to add the stash functionality to the project.
Now that having a PR ready, this post shares the learnings and findings as well as the implementation techniques with the community, hoping to benefit those who are looking for a similar solution or are simply interested in the object model and data structure in general.
Stash vs Branch
Git object model employs Merkle tree to ensure data integrity, enable efficient storage and retrieval, and facilitate data verification and synchronization. This recursive tree structure, composed of hashes, represents snapshots of arbitrarily large hierarchical data sets – namely, our project's working directory’s revision history.
In Git, both branch
and stash
function as "refs
" (references) pointing to commits. They share the ability to be manipulated via Git commands for navigating commit history. However, their commonality are limited.
Conceptually, branches are references to the terminal points with a commit tree, they are integrated intrinsically into the project's permanent history. In contrast, stashes reside within a separate namespace (refs/stash
). Git generates a commit object to encapsulate staged and working directory modifications when you create a stash
, but this commit neither integrates into the commit history nor associates with a specific branch. Furthermore, refs/stash
entries are local and excluded from git push
operations.
A stash is a specialized ref
that preserves a work-in-progress (WIP) state as a commit. This allows restoration to that state without altering the branch history. In essence, it is a dangling commit possessing two parents: the commit at HEAD when the stash
was created and the index state at that moment. If unstaged changes are included via stash push
(the default), a third parent is added to represent a snapshot of the working directory tree at the time of stash creation.
A stash can be seen as a temporary branch that is not meant to be merged into the main branch. It is useful for saving changes that are not ready to be committed or pushed, or for switching between branches without losing work. Unlike branches, stashes are not named and are stored in a stack-like structure, where the most recent stash
is at the top. Stashes have their own way to be accessed, including stash list
, stash show
, and stash apply
commands.
While branches and stashes are both implemented as ref
s, their intended use and impact on the repository are distinct. Implementing git stash
does not benefit from the functionality of branch
. Instead, direct manipulation of low-level Git objects is required to generate parent commits, trees, and reflogs. This task involves interaction with loose objects, but not packfiles, which are optimized for object transport with efficient storage and indexing.
Before we start to implement our own, let's inspect what regular git client creates for git stash
. We can poke into what's inside refs/stash
from command line:
$ cat .git/refs/stash
0f47ba5d10ed30779c078d31b3fc47331e4bb98b <-- current stash commit SHA
From the commit SHA, we can find the stash commit object:
$ git cat-file -t 0f47ba5d10ed30779c078d31b3fc47331e4bb98b
commit <-- type of the object
and the commit object has two parents:
$ git cat-file -p 0f47ba5d10ed30779c078d31b3fc47331e4bb98b
tree a0d51b4516951595dc4fd6ecb9beb50da495e210
parent b72fb56d1f53d4120ee6fd206ea1e77ee4071c5f
parent 41d7e412a828f5c6838f6d064881803316afeea3
author modesty <pdf2json@github.com> 1707931505 -0800
committer modesty <pdf2json@github.com> 1707931505 -0800
It points to a tree and has two parents, each represents index and working directory changes, respectively. Our task is to create a same structured commit object for `git stash
` in isomorphic-git, it's the core for other stash
operations, including apply
, drop
, pop
, list
and clear
, etc.
Git Objects: Recap
Now that we understand the unique nature of stashes, a concise review of Git object types will illuminate their internal representation. Stashes leverage three of the four fundamental Git object types: commits, trees, and blobs (we will omit tags for this discussion).
Recall the relationship between commits and trees: a tree can be conceptualized as a directory snapshot, embedding both subdirectories (represented as tree objects) and references to any number of file content blobs. Blob objects contain compressed file content, their SHA-1 hashes serving as references within their parent tree. This "children do not know parent" model makes sharing object very efficient, if a file exists in multiple branches (which is common case), only its SHA is referenced, not the file itself.
To examine the internal structure of a stash, we can inspect its associated tree object:
$ git cat-file -p a0d51b4516951595dc4fd6ecb9beb50da495e210 <-- the tree SHA above
100644 blob 1bff7bfa72823e7cd90a6e9304156571e53b0988 .eslintrc.json
100644 blob 3a15fc6e4f3d3b359f9f2900fd6e1d61824fe5c2 .gitignore
040000 tree ddfdb61937565b864aaa210a187890a7dfefdf0d .husky
100644 blob bcea5db4c05c21cbd19ea6a5fae4cb6c71e8dc19 README.md
100644 blob cb00c3f18ee24b4e75abdf9883fd6f284f048ec2 build.properties
100644 blob 0594052072a108a4446678e35e035bb0955ffb23 package-lock.json
100644 blob 3ef98a63860e011e1b573ab36302542d68bc320a package.json
040000 tree 8c4401ac77ef562385f401d90feeca7a9d6ed304 packages
100644 blob 35259cddd2bb0c6f17e40edbaa03f4feba19d142 pom.xml
040000 tree 806243fa28d66c2904f8d4eb0623a0718383f63a scripts
100644 blob 3ac4170137102fd0d14e9e2e345a794f28b21b4f tsconfig.base.json
100644 blob 2e7ef9708ca1cd337a8a57ab101b986a7a2ed8a2 webpack.base.js
The inherent elegance of the Merkle tree model is evident in its efficient and concise representation of filesystem directory contents. Each tree entry encapsulates file system permissions, filename, and the SHA-1 hash of any file or subdirectory it contains.
Below is a simplified diagram illustrating the relationships between these Git object types:
Commit Object
│
├─ Tree Object 1 ─ Blob Object (File content)
│ │
│ ├─ Tree Object 2 ─ Blob Object (File content)
│ │
│ └─ Blob Object (File content)
│
└─ Parent Commit Object (Previous commit)
│
└─ Tree Object ─ Blob Object (File content)
Crucially, the Git object model does not explicitly store "diffs" or "logs." When executing commands like git show
, git log
, or git status
, the output is dynamically generated by analyzing and comparing trees and their constituent objects. This fundamental principle is essential to the implementation of git stash
: by constructing objects and establishing relationships in accordance with the Git object model, we can seamlessly introduce new stash functionality.
Create a Stash
Implementing git stash
within isomorphic-git necessitates direct utilization of low-level APIs. Functions such as writeCommit
(distinct from the higher-level commit
), writeTree
, TREE
, WORKDIR
, STAGE
, and crucially, walk
facilitate this process.
Below are the general steps involved in generating a stash, which by default encapsulates tracked changes from both the index (staged) and working directory (unstaged):
- Obtain the HEAD of the current branch: This commit serves as the first parent of the stash commit, embodying the repository state at the time of stash creation.
const headCommit = await GitRefManager.resolve({fs, gitdir, ref: 'HEAD'})
- Construct a tree representing index changes: If staged alterations exist, generate a commit object rooted in the index tree. This commit is the second parent of the stash commit.
const indexTree = await writeTreeChanges
(fs, dir, gitdir, [ TREE({ref: 'HEAD'}),'stage'])
if (indexTree) {
const stashCommitOne = await stashMgr.writeStashCommit({
message: `stash-Index: WIP on ${branch} - ${new Date().toISOString()}`,
tree: indexTree,
parent: stashCommitParents,
})
stashCommitParents.push(stashCommitOne)
stashCommitTree = indexTree
}
- Construct a tree representing working directory changes: If unstaged modifications to tracked files are present, generate a commit object with the working tree as its root. This commit forms the third parent of the stash commit.
const workingTree = await writeTreeChanges(fs, dir, gitdir, [workDirCompareBase,'workdir'])
if (workingTree) {
const workingHeadCommit = await stashMgr.writeStashCommit({
message: `stash-WorkDir: WIP on ${branch} - ${new Date().toISOString()}`,
tree: workingTree,
parent: [stashCommitParents[stashCommitParents.length - 1]],
})
stashCommitParents.push(workingHeadCommit)
stashCommitTree = workingTree
}
- Create the stash commit: With all parent commits prepared, assemble the final stash commit.
const stashCommit = await stashMgr.writeStashCommit({
message: message.trim() || `stash: WIP on ${branch} - ${new Date().toISOString()}`,
tree: stashCommitTree,
parent: stashCommitParents,
})
- Record the stash in .git/refs/stash: Persist the newly created stash in the Git object store.
await stashMgr.writeStashRef(stashCommit)
- Append a reflog entry: document the stash creation commits within the reflog for historical tracking.
await stashMgr.writeStashReflogEntry({
stashCommit,
message: `WIP on ${branch}: ${headCommit.substring(0, 7)} ${headMsg}`,
})
- Restore the working directory to a clean state: Revert the working directory to its pre-stash condition.
await checkout({fs, dir, gitdir, ref: branch, track: false,
force: true,
})
Apply a Stash
While a stash is represented internally as a commit, simply checking out the associated commit tree does not fully replicate the intended stash behavior. Notably, a standard checkout would leave the working directory in a detached HEAD state and neglect to restore modifications to the index (staging area). A tailored approach is necessary for comprehensive stash application.
The core strategy is straightforward: we analyze the stash commit and its parent commit trees, applying any detected changes to the working directory and index as appropriate. Here's the procedural breakdown:
- Retrieve parent commits from the stash ref: Access the parent commits associated with the stash commit stored within .git/refs/stash.
const stashCommit = await stashMgr.readStashCommit()
const { parent: stashParents = null } = stashCommit.commit ? stashCommit.commit : {}
if (!stashParents || !Array.isArray(stashParents)) {
return
}
- Iterate over parent commits and apply differences: For each parent commit, compare its tree representation with the preceding parent. Apply any identified changes to either the working directory or staging area.
for (let i = 0; i < stashParents.length - 1; i++) {
const applyingCommit = await _readCommit({fs, cache: {}, gitdir, oid: stashParents[i + 1]})
const wasStaged = applyingCommit.commit.message.startsWith('stash-Index')
await applyTreeChanges(fs, dir, gitdir, stashParents[i + 1], stashParents[i], wasStaged)
}
That's it!
We can skip other common git stash operations, like pop (it's really apply + drop), drop (remove the one refs from .git/refs/stash and the top entry in reflogs), list (read and format reflogs), and clear (remove both .git/refs/stash and reflogs), nothing interesting about them.
However, notice the invocation of writeTreeChanges
when we create either index or work directory trees? plus applyTreeChanges
above? These are the essence of stash, also where I spent my most time on, let's take a closer look.
Tree Walker
At the heart of our stash implementation lies a customized tree comparison mechanism. This enables us to generate trees representing the index (staged changes) and working directory changes, essential for encapsulating the stash state. Isomorphic-git's walk
API provides the foundation for this functionality.
1. Tree Entry Filtering
We strategically employ the WalkerIterate
callback to filter out irrelevant tree entries, ensuring the resulting tree reflects the intended stash contents. The isStage
variable helps discern between index and working directory tree processing, while the changedEntries
variable captures the detected differences.
const iterate = async (walk, children) => {
const filtered = []
for (const child of children) {
const [head, stage] = child
if (isStage) {
if (stage) {
if (await fs.exists(`${dir}/${stage.toString()}`)) {
filtered.push(child)
} else {
changedEntries.push([null, stage])
}
}
} else {
if (head) {
if (!stage) {
changedEntries.push([head, null])
} else {
filtered.push(child)
}
}
}
}
return filtered.length ? Promise.all(filtered.map(walk)) : []
}
2. Analyzing Changes with Tree Mapping
The WalkerMap
callback pinpoints changes at the individual file and subtree level. Like in WalkerIterate
, head
and stage
variables represent the trees being compared. It leverages the changedEntries
structure to track modifications. This callback returns a TreeEntry
-like object, a crucial building block for tree construction.
const map = async (filepath, [head, stage]) => {
if (filepath === '.' ||
(await GitIgnoreManager.isIgnored({ fs, dir, gitdir, filepath }))
) {
return
}
if (stage) {
if ( !head ||
((await head.oid()) !== (await stage.oid()) &&
(await stage.oid()) !== undefined)
) {
changedEntries.push([head, stage])
}
return {
mode: await stage.mode(),
path: filepath,
oid: await stage.oid(),
type: await stage.type(),
}
}
}
3. Composing Tree Structures
The WalkerReduce
callback facilitates the hierarchical composition of trees. It assembles TreeEntry
instances and their children, ultimately determining the final tree structure.
const reduce = async (parent, children) => {
children = children.filter(Boolean)
if (!parent) {
return children.length > 0 ? children : undefined
} else {
parent.children = children
return parent
}
}
4. Validation and Flattening
An additional recursive process verifies that each TreeEntry
possesses a valid oid
. This step handles children entries for tree-type objects, writing them using the _writeTree
function. Blob entries undergo validation through the checkAndWriteBlob
function. This stage prepares the entries for seamless integration with the writeTree
API.
async function processTreeEntries(fs, dir, gitdir, entries) {
async function processTreeEntry(entry) {
if (entry.type === 'tree') {
if (!entry.oid) {
const children = await Promise.all(entry.children.map(processTreeEntry))
entry.oid = await _writeTree({
fs,
gitdir,
tree: children,
})
entry.mode = 0o40000
}
} else if (entry.type === 'blob') {
entry.oid = await checkAndWriteBlob(
fs,
gitdir,
dir,
entry.path,
entry.oid
)
entry.mode = 0o100644
}
entry.path = entry.path.split('/').pop()
return entry
}
return Promise.all(entries.map(processTreeEntry))
}
5. Tree Generation
The walk
function, configured with our custom callbacks, orchestrates the tree generation process. The final step involves invoking writeTree
to physically create the tree objects that will be associated with stash
commits.
const entries = await _walk({
fs,
cache: {},
dir,
gitdir,
trees,
map,
reduce,
iterate,
})
if (changedEntries.length === 0 || entries.length === 0) {
return null
}
const processedEntries = await processTreeEntries(fs, dir, gitdir, entries)
const treeEntries = processedEntries.filter(Boolean).map(entry => ({
mode: entry.mode,
path: entry.path,
oid: entry.oid,
type: entry.type,
}))
return _writeTree({ fs, gitdir, tree: treeEntries })
That's all to create the tree objects that would associate with stash
commits.
Applying a stash
necessitates another tree walk operation. However, the iterate
and reduce
callbacks are not required here. Instead, the map
callback directly handles applying the changes. It encapsulates the logic for deletions, additions, and updates to both files (blobs) and directories (trees
).
export async function applyTreeChanges(
fs,
dir,
gitdir,
stashCommit,
parentCommit,
wasStaged
) {
return _walk({
fs,
cache: {},
dir,
gitdir,
trees: [TREE({ ref: parentCommit }), TREE({ ref: stashCommit })],
map: async (filepath, [parent, stash]) => {
if (
filepath === '.' ||
(await GitIgnoreManager.isIgnored({ fs, dir, gitdir, filepath }))
) {
return
}
const type = stash ? await stash.type() : await parent.type()
if (type !== 'tree' && type !== 'blob') {
return
}
const currentFilepath = join(dir, filepath)
if (!stash && parent) {
if (type === 'tree') {
await fs.rmdir(currentFilepath)
} else if (type === 'blob') {
await fs.rm(currentFilepath)
}
return
}
const oid = await stash.oid()
if (!parent || (await parent.oid()) !== oid) {
if (type === 'tree') {
await fs.mkdir(currentFilepath)
} else {
const { blob } = await readBlob({ fs, dir, gitdir, oid })
await fs.write(currentFilepath, blob)
}
if (wasStaged) {
await add({ fs, dir, gitdir, filepath })
}
}
return {
path: filepath,
oid: await stash.oid(),
type: await stash.type(),
}
},
})
}
Conclusion
Git's stash functionality provides a compelling use case for the Merkle tree, demonstrating its versatility beyond data validation and synchronization into the realm of difference tracking and state preservation. Git's efficient data organization, indexing, querying, and data compression techniques – all facilitated by Merkle trees – underscore the model's power for managing large, complex datasets at scale. Furthermore, the concept of multiple-parent commits ingeniously enables the stash workflow without disrupting core Git operations.
Potential avenues for refinement and enhancement include:
- Conflict Resolution: Incorporating conflict detection and resolution mechanisms for
stash apply
and pop
. - Streamlined Implementation: Exploring optimizations, such as combining tree walks for the default stash use case.
- Specialized Stash Modes: Supporting advanced scenarios like "stash staged only" and "stash with untracked files" while retaining the flexibility of separate tree walks.
This exploration into the internals of Git's stash implementation has been highly instructive. I anticipate continuing these investigations and refinements beyond the initial pull request.
History
- 9th March, 2024: Initial version