Introduction and Background
Major Updates 2007-03-05:
- Several bugfixes and enhancements;
- Added 3rd, faster bufferless strategy:
- transposeForward method;
- Illustration 6
- Added long performance tests;
- More options in demo program;
- Graphed performance statistics for each method;
- Added console screenshots for demo program;
First off, if you're a developer in a hurry and here for a quick solution (i.e. you're not really interested in reading the article), you can skip down to the "Using the Code" section towards the bottom. That should provide an adequate Quick Start to implement this class & get your problem solved in a matter of minutes. That said, the rest of you wise and patient people, read on...
What follows is an analysis of what should be a dead-simple task: Insert text into an existing file. This all came about because a project of mine required good, byte-at-a-time file handling. Specifically, given a potentially very large file, I needed to be able to insert, overwrite, find, replace and delete text within that file. Overwriting, truncating and appending are easy, but conflicting opinions abound about how best to insert text into existing files. The obvious problem is that the BinaryWriter lacks an "insert" mode, so it overwrites anything in its path when you use it. Sure, Microsoft could have cobbled together a special-purpose StreamWriter that transparently handled all the mechanics of this, but they didn't, and it's probably just as well. Who knows... we might learn a thing or two as we create our own.
Problem Summary
For a simple example, say we have a file containing "eggyung". We want to insert "foo" between "egg" and "yung." For convenience, I'll refer to the text before the insertion point ("egg") as the left-hand or LH segment, and the text after the insertion point ("yung") as the right-hand or RH segment.
All we need to do is move "yung" over to make room for "foo" in the target file. Some people do this by holding that RH text in some sort of buffer -- a temp file, a MemoryStream, a StringBuilders, maybe just a plain old string. Others do it by creating a new file and writing to it. First, the original LH segment, then the text being inserted, and finally the original RH segment. After all that, the original file is deleted or moved, and the new file is renamed to take over the original's identity.
Since my project needs to deal with potentially very large files, and because I rather dislike the thought of instantiating multi-gigabyte Strings and MemoryStreams, I dismissed the String buffer and RegEx search/replace approaches from the get-go. I also have an irrational (maybe even psychotic) aversion to temp files and file-name swapping; maybe I was attacked by a temp file as a child and I've suppressed the memory... I don't know. At any rate, I don't want to write any more data to disk than is necessary, which means not touching the LH segment at all and handling the RH segment as little as possible.
That said, let's dig into the solution...
Solution Strategy
A word about the "without temp files" bit in this article's title... these statements have not been evaluated by the FDA, and there's a bit of fine print: In some cases, data can be written to a temporary location that's neither in a temp file, a local variable, nor a MemoryStream. Where is it then? How about at the end of the target file itself! But the goal here is to accomplish the file modification with a minimum number of bytes written, meaning we buffer only as a last resort. So, after illustrating this simple "target-file-as-its-own-temp-space" trick, I'll describe several other techniques that eliminate the need for any temp buffering at all.
If the worst-case scenario involves handling the RH data twice, then the good news is that it's not always necessary to do that, even when using conventional file & stream techniques. If the text to be inserted is longer than the RH segment, then we know that the RH segment's post-insertion destination does not overlap its original position. This is good; it lets us take a shortcut around buffering: (a) Extend the EOF by the length of the inserted text, (b) copy the RH segment into the newly added space, and (c) write the text we're inserting at the insertion point, overwriting any remnants of the original RH data (Illustration 1):
Unfortunately, this doesn't work when the inserted text is shorter than the right-hand segment - at least not if we're copying byte-by-byte as above. In that situation, the RH segment's source and destination ranges overlap, which means we'd eventually start overwriting characters that we hadn't yet read if we continued along in this manner (Illustration 2):
The "normal" solution is to buffer the RH chunk somewhere, and for small to medium files, that absolutely makes sense. For very large files though, this isn't practical. Here's another approach: (a) add space equal to the size of the RH chunk to the end of the target file, (b) copy the RH chunk there, (c) write the text we're inserting at the insertion point, (d) copy the RH chunk back to where the newly inserted text ends, and lastly (e) truncate the file to the correct post-insertion length (Illustration 3):
This approach isn't ideal, but it keeps us from having to worry about managing temp files, dealing with lock collisions, or swapping filenames. But can't we do any better here? The better news is that, yes, we can do away with temp buffering altogether. Sound good? (Note: there's still an example routine in the source code for Illustration 3, called InsertBytesUsingEOFTemp
.)
The method for doing this is easier than you might expect; it simply involves using a StreamReader and BinaryWriter to move backwards through a byte range to be copied. By stepping through the bytes in right-to-left order, we can copy any range to a destination overlapping its own right edge without ever overwriting yet-to-be-read source bytes. Illustration 4 shows how: Start with the Reader positioned at the end of the source range. The corresponding writer starts at the end of the destination range. Each read/write operation decrements the position of both, rather than advancing them as would usually be the case (Illustration 4):
So we've eliminated temp buffering at this point, but the forward-only nature of fast, non-cached StreamReaders imposes a hefty performance penalty when we reposition them backwards. The class conceals an internal buffer that has to be cleared before each backwards seek operation. However, a modest read/write buffer size will drastically cut down on the number of these expensive operations. Add that to our performance gains from never having to write the RH segment to disk more than once, and we're in pretty good shape. Illustration 5 shows the modified technique:
A quick word to those who are about to get pedantic with me about the read/write buffer: Yes, there is a "buffer" there, but a read/write buffer is very different from the temp buffers we set out to avoid, and I'm counting on the reader to recognize that distinction. For those who just can't go there with me, there's always the option of going back to Illustration 4 and doing things a byte at a time. You'll take a mean performance hit, but hopefully it will be worth feeling vindicated about that pesky read/write buffer.
We're doing better than before at this point, but we're still pitting memory allocation against seeking backwards with StreamReaders. When I started examining the performance of this approach, I found that peak performance depended on the server's available physical memory exceeding the number of the bytes to be moved. We can't count on this being true, so let's take another crack at improving performance some other way.
The main drags on performance in Illustration 4 come from two operations: Allocating large memory blocks, and repositioning a StreamReader backwards. Therefore, our solution parameters should include:
- Continue to avoid redundant disk caching of the RH segment;
- Replace large memory allocations with several smaller ones;
- Eliminate any backward repositioning of stream readers;
- Read/write buffers are permissible;
This leads to a pretty simple solution: Use two read/write buffers instead of one. We read into the 2nd buffer n
bytes ahead of the write position, where n
is the buffer's size, then copy the 2nd buffer's contents into the 1st buffer, then write the 1st buffer's (new) contents to the file. Lather, rinse and repeat until done, as Illustration 6 shows:
With this strategy, there's not much use in making your read/write buffer much bigger than half the bytes to be copied, by the way. The risks are far worse for using too large a buffer size than using one that's too small. If the buffer's too small, you may or may not see a bit of drag on performance. If the buffer's too large though, you could (a) run into an , or (b) have the runtime give you the memory only to page a bunch of data to disk in the process (just what we were trying to avoid in the first place).
Fortunately, this routine happens to perform best with surprisingly small buffer sizes. Now comes the task of finding the "optimal" size, and the demo code contains a number of performance tests to help us do just that. To make this easy to try at home, I will quickly explain the demo program before giving you the results of my own tests.
<h2
>Demonstration
</h2
>
To run the demos, simply open the solution and hit F5. You don't even need any test files to get started; by default, the demo routines will create them on the fly. The first set of tests simply creates a small file and walks through the different Insert methods, using a <code>Find()</code /> routine to make sure the inserted text actually appears in the file afterwards.
After this, the demo program will ask if you want to run the long performance tests. If you respond with a "Y," you'll see a screen like the following:
"328" alt="Screenshot - FileOpsDemoScreen2.png" src="InsertTextInCSharp/FileOpsDemoScreen2.png" width="640" />
Use one of the options above to create a large test file, or provide your own if you wish. The next question prompts you for an alternate drive if desired. If you have a separate physical drive from your system drive - especially if you have a RAID-0 array available - I highly recommend using it. A SATA-II RAID-0 array without contention outperformed my system drive by a 2:1 margin or better in almost every test I ran. If you don't have a PC with a fast RAID though, any halfway decent disk on a separate spindle from your system drive can still speed things up significantly.
Once your test file is ready, the next screen will prompt you for one final option before the tests run: the insert strategy. Option [1] represents the "target-as-its-own-temp-file" strategy from Illustration 3, Option [2] represents the "reverse traversal" method from Illustration 5, and Option [4] represents the "buffer swapping with forwards transposition" technique in Illustration 6. (If you make an invalid entry here, it will default to [2], using the <code>transposeReverse method).
Once you've made your choice and hit the [Enter] key, the tests will start running. Option 1 is not affected by buffer size, so only a single insert will be performed. Options 2 and 4 will run a series of tests using different buffer sizes to insert text at the very beginning of the target file. Note that the time estimate the program gives is a rough one, and it's only even approximately accurate when testing Option 2. Still, on most systems the Option 2 tests will finish in the estimated time or less, and the other options will finish even faster... Option 1 because it doesn't test multiple buffer sizes, and Option 4 because it's a speed demon.
When the test routines finish, the demo will first give you the opportunity to view the head of the modified test file, and then it will clean up after itself by deleting the temp file if you wish. After that, you'll have the choice to run another test series or quit.
If you inspect the sample code, you'll see a few routines default to a 32kB read/write buffer when none is specified by the caller. Feel free to experiment with that size; your mileage may vary, but I found that the transposeForward
double-buffering approach was consistently the best performer, and 32kb was that algorithm's best or second best performing buffer size on every file size tested.
The following graph shows the average of about 25 different test runs of each algorithm on my system. My basic hardware setup was an AMD Athlon X2 3800 dual-core processor, 2GB RAM, and I wrote to a 1TB RAID-0 made of Seagate SATA II drives. The speed is shown in megabytes written per second for each given buffer size, and represents an average across target files from 32MB to 1GB.
transposeReverse
's nature is to gain speed as buffer size increases up to a ceiling that is a function of available memory relative to the amount of data being moved. I found it fascinating that transposeForward
always - regardless of file size - had its best performance when the buffer was between 16KB and 256KB, with a peak around 32k, and that it always had a pronounced valley with buffers between 512KB and 2MB.
The "target-file-as-its-own-temp-space" strategy doesn't use an iterative read/write buffer, so its performance was a pretty flat 10MB/second across all trials.
A few observations about coding and performance tuning:
- Watch out for the different signatures between BinaryWriter's Seek(int position, seekOrigin origin) and FileStream.seek(long position, seekOrigin origin). You can only use BinaryWriter's native Seek method on positions up to 2GB.
- Buffer size is limited to 2GB (or int.MaxValue) by virtue of the fact that neither the Reader's Read method nor the writer's Write method have an overload that accepts a long for the Count parameter.
- Always set the file's length (see SetFileLen in the sample code) before doing a bunch of write operations that increase the file's size. I found that file writes were 3:1 faster without the overhead of implicit file size increases during each write iteration.
- transposeForward could gain some efficiency by using a buffer size that's evenly divisible into the length of bytes to be copied, when possible. As of this writing, the code makes no attempt at this optimization.
- None of the routines in this class make any provision for double-byte character sets, so it will take some modification if you want to use these on Unicode / UTF8 files, etc.
There's a good deal more to be learned about optimizing disk operations. If this topic interests you, a good place to start is Microsoft's white paper, Sequential File Programming Patterns and Performance with .NET.
The Implementation
The main class for the methods described in this article is simply called FileHelper
. All the methods are static, and there are a few simple supporting routines that I'll just briefly describe. First is a WriteBytes
routine that just wraps the basic functionality of a BinaryWriter... it seeks to a desired position in the target file and writes an array of bytes there. Next, SetFileLen
encapsulates the FileStream class's SetLength method.
The work that's of interest to us happens mostly the routines InsertBytes
, Transpose
, transposeReverse
, and transposeForward
. (Public methods begin with initial capitals, while private ones are camelCase.)
InsertBytes
forms the outer layer of this rig, while Transpose
provides a convenient place for some minimal "smarts" while relying on the supporting routines above to do most of the work. Transpose
gets a filename, an array of bytes to be inserted (simply named "bytes[]"), and an insertion position from InsertBytes
. Armed with these, it compares the length of the RH segment with that of bytes[]. If the RH segment is shorter than bytes[], it skips buffering and uses the approach in Illustration 1. If the RH segment is longer, it calls on transposeForward
to execute the buffer-swapping copy approach from Illustration 6.
public static void Transpose(string filename, long SourcePos, long DestPos,
long Len, int bfrSz)
{
if (DestPos > SourcePos && Len > (DestPos - SourcePos))
{
transposeForward(filename, SourcePos, DestPos, Len, bfrSz);
}
else
{
using (FileStream fsw = new FileStream(filename,
FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.Read))
{
using (FileStream fsr = new FileStream(filename,
FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (StreamReader sr = new StreamReader(fsr))
{
using (BinaryWriter bw = new BinaryWriter(fsw))
{
sr.BaseStream.Position = SourcePos;
bw.BaseStream.Seek(DestPos, SeekOrigin.Begin);
for (long i = 0; i < Len; i++)
{
bw.Write((byte)sr.Read());
}
bw.Close();
sr.Close();
}
}
}
}
}
}
I'm skipping over an explanation of transposeReverse
since it's only called from demo code; it's not really useful in production anyway, given how much better transposeForward
's performance is. transposeForward
takes the same arguments, and copies the source byte range to the destination position a-la Illustration 6:
private static void transposeForward(string filename, long SourcePos,
long DestPos, long Length, int bfrSz)
{
long distance = DestPos - SourcePos;
if (distance < 1)
{
throw new ArgumentOutOfRangeException
("DestPos", "DestPos is less than SourcePos," +
"and this method can only copy byte ranges to the" +
"right.\r\n" + "Use the public Transpose() " +
"method to copy a byte range to the left of itself.");}
long readPos = SourcePos;
long writePos = DestPos;
bfrSz = bfrSz < 1 ? 32 * KB :
(int)Math.Min(bfrSz, Length);
bfrSz=(int)Math.Min(bfrSz,
(My.ComputerInfo.AvailablePhysicalMemory * .4));
long numReads = Length / bfrSz;
byte[] buff = new byte;
byte[] buff2 = new byte;
int remainingBytes = (int)Length % bfrSz;
using (FileStream fsw = new FileStream(filename,
FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.Read))
{
using (FileStream fsr = new FileStream(filename,
FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (StreamReader sr = new StreamReader(fsr))
{
using (BinaryWriter bw = new BinaryWriter(fsw))
{
sr.BaseStream.Seek(readPos, SeekOrigin.Begin);
bw.BaseStream.Seek(writePos, SeekOrigin.Begin);
sr.BaseStream.Read(buff2, 0, bfrSz);
for (long i = 1; i < numReads; i++)
{
buff2.CopyTo(buff,0);
sr.BaseStream.Read(buff2, 0, bfrSz);
bw.Write(buff, 0, bfrSz);
}
buff2.CopyTo(buff,0);
if (remainingBytes > 0)
{
buff2 = new byte[remainingBytes];
sr.BaseStream.Read(buff2, 0, remainingBytes);
bw.Write(buff, 0, bfrSz);
bfrSz = remainingBytes;
buff = new byte;
buff2.CopyTo(buff,0);
}
bw.Write(buff, 0, bfrSz);
bw.Close();
sr.Close();
buff = null;
buff2 = null;
}
}
}
}
GC.Collect();
}
transposeForward
and transposeReverse
are both declared private, because the InsertBytes
and Transpose
methods form the public interface for these features, and those routines should conceal all this byte-shifting complexity from the caller. transposeForward
will simply throw an ArgumentOutOfRangeException if you try to copy a byte range to the left with it.
These routines are well suited for finding and replacing text in files, since that's little more than a special case of transposition. You'll need to roll your own Replace
method if you want one, but that's a cinch with the above ingredients. Using FileHelper's Find
method, define the RH segment as starting at the end of the FindWhat occurrence in question. If the ReplaceWith text is shorter than the FindWhat text, you can use just steps 3, 4 and 5 from Illustration 3... simply write ReplaceWith starting at the position of FindWhat, then shift the RH segment left & truncate the file. When ReplaceWith is longer than FindWhat, then you'll have to move the RH segment just like when performing an insert, but accounting for the difference between the lengths of FindWhat and ReplaceWith when determining the RH segment's destination position.
If you explore the FileHelper
class, you'll see that I included a few other possibly useful routines. Among them are Head
and Tail
methods for inspecting the beginning or end of large files without having to try to open them in Notepad; an XML validation routine, some filesize-to-human-readable string conversions, and a method to check AvailalbePhysicalMemory (borrowed from VB.NET's "My" namespace).
Using the Code
The code in FileOps.cs comprises everything described above, and also contains a TestCases class to demonstrate the routines on the included test files. You should only need minimal variations on the test routines to use this code in your project -- just include FileOps.cs in your project, pick the test routine most closely resembling your task, copy it -- editing a line or two to work with your own files -- and away you go. The combination of WriteBytes
, InsertBytes
, Transpose
and SetFileLen
are really all you need to accomplish most conceivable file manipulation tasks.
For a really Quick Start, I recommend just downloading the solution and running it. It will jump into the FileOps.TestCases.RunDemo()
routine right away, which will create its own test files to demonstrate the above routines in a console (as detailed in the preceding screenshots).
Next Steps
For most projects, the above setup is more than adequate. Still, there may be areas ripe for significant optimization, and others where more minor improvements are possible.
The first glaring area that's ripe for optimizing is reducing how much we're moving these file chunks around. In the majority of cases, everything to the right of the inserted text will be written to disk once. If a file is large and you're inserting bytes near its beginning, this could mean moving a lot of data just to insert a little. The first question one should ask then is, does all this data really need to be in the same file to begin with? If not, the time to break it up is now, before the files start growing out of control and you have to change production code to accommodate a new file format.
Sometimes though, you really do need to confine all the data to a single file. When that's the case, you might want to consider whether your inserts really need to be at a specific (early) position in the file. Again, if not, you'll gain considerable performance by moving them as close to the EOF as possible.
When the types of inserts we're discussing are unavoidable, though, two harebrained ideas involving black magic in the filesystem come to my mind. First, if one could simulate a FileStream's SetLength method, but prepending the added space instead of creating it at the end of the file, this could speed up inserts left of the file's center. Free space at the beginning of the file would allow us to move and/or buffer the (shorter) LH segment instead of the (longer) RH one as we've been doing. This wouldn't be easy, though... the Win32 API has a SetEndOfFile method
, but you'll have to make your own SetBeginningOfFile
.
An even more daring approach to minimizing disk writes could be to use low-level MFT manipulation to deliberately fragment the file into 3 pieces, 1 each containing, respectively, the LH segment, the inserted text, and the RH segment. I don't how feasible this is (I'm not Mark Russinovich after all), but it would probably be extremely efficient, given that it negates the need to rewrite any existing data. Defragmenting would happen as a normal part of disk maintenance. One would need some wicked error handling to manage the MFT for oneself, though, and that thought keeps me from pursuing this idea much further at the moment.
One last area that might be useful to optimize, depending on the usage profile, is the briefly-mentioned Find
function, which brute-force searches a file for some given text. In some cases, a "smart" search algorithm such as Boyer-Moore, Turbo Boyer-Moore, Knuth-Morris-Pratt or Horspool will outperform the naive, brute-force method. This would be most beneficial in situations where long strings with high character uniqueness are being sought within large files. Searching wasn't dealt with at length here, but in practice, time spent doing find/replace work can represent a not insignificant portion of your total processing time.
I will leave these ideas as exercises for the reader. I'm sure there are other approaches to the suggestions I've mentioned, as well as areas for optimization that I didn't even think of. If you happen to be struck by a flash of genius and feel like tackling one of these, I hope you'll share your efforts with the rest of us!
Conclusion
The methods described above accomplish the original objective, which was to provide convenient text insertion into potentially large files, without using memory buffers, temp files or swap files. For many applications, this is more than enough. However, the right optimizations could yield a genuinely high-performance disk reading and writing utility, useful in applications like ORM systems, embedded databases, and just simple file-based data management tools that are free from external dependencies.
There seems to be a healthy interest in these sorts of topics here on CodeProject, with a growing number of articles on XML databases and even one describing a transactional filesystem. A simple, easy-to-implement utility like this one, in pure C# no less, could provide respectable scalability for such systems. I hope to work more on this myself in the future, and I hope others find it worthwhile to contribute their time, ingenuity and expertise as well.
Good luck and happy coding!
Revision History
2007-03-05 :
- Several bugfixes and enhancements;
- Added 3rd, faster bufferless strategy:
- transposeForward method;
- Illustration 6
- Added long performance tests;
- More options in demo program;
- Graphed performance statistics for each method;
- Added console screenshots for demo program;
2007-02-21 : First posted to CodeProject;
2006-01-28 : Initial draft;