Introduction
This article is about a workaround that I developed to compensate for unexpected behavior of the Git clone command when it downloads a file that contains Unix line breaks and converts them to Windows line breaks, while the application expects Unix line breaks.
Background
A Git repository may include an optional .gitattributes
file that instructs Git clients about how to handle line breaks in text files. While this works well for transferring files upstream to an origin repository on GitHub, Bitbucket, or TFS, the Git client either ignores or misinterprets its instructions when the file is downloaded as part of a clone, pull, or merge to a local repository. I am unsure whether clone, merge, and pull ignore .gitattributes, or misinterprets it.
Regardless, the outcome is the same; the text file cannot be assumed to contain Unix line breaks after a round trip through Git.
Today, there are three line break conventions in widespread use, which are summarized in Table 1 below. To some readers, the third convention, Legacy Macintosh, may be confusing; since the operating system on modern Macintosh computers is a heavily customized version of Linux, they follow the Unix line break conventions.
Table 1 summarizes the common line break conventions in use today.
Convention
| Character Codes
| Comment
|
Windows
| 0x0d0a
| Carriage Return followed by Line Feed
|
Unix
| 0a
| Line Feed
|
Legacy Macintosh
| 0d
| Carriage Return
|
By default, when Git clients running on Windows transfer a text file to a remote repository, Windows line breaks are replaced with Unix line breaks. Conversely, when a text file that contains Unix line breaks is transferred from a remote Git repository to a local repository on a Windows host, the Unix line breaks are replaced with Windows line breaks.
In most cases, this transformation is transparent and desirable. However, the case that gave rise to this article is an exception, because some of the substrings that must be replaced in the JSON string contain embedded line breaks that are significant because they are critical to accurately identifying the substrings. Moreover, the replacement strings also contain line breaks. While the JSON parser that eventually processes the string may be immune to these issues, my preprocessor is not.
The foregoing explains why the repository includes Test_Data.zip
.
The other archive, Binaries.zip
, incorporates the compiled code and intermediate files that are, by convention, excluded from source code control.
- The
bin
directory contains the final products, including copies of the required DLLs. - The
obj
directory contains the intermediate files generated by the build process; without them, the build engine will insist on rebuilding them the first time you press F5 to run the code in Visual Studio.
Both archives are structured so that they reproduce the intended directory structure when both are extracted “here,” as explained in README.md
.
Though there are undoubtedly scores of libraries that I could have employed to resolve the issue, since I had a bit of time, I decided to roll my own. It is not as easy as it looks, but the task lends itself to the application of a simple state machine similar to the one that drives A Robust CSV Reader. (The latest source is on GitHub at https://github.com/txwizard/AnyCSV, and the corresponding NuGet package is at https://www.nuget.org/packages/WizardWrx.AnyCSV/.)
Using the Code
Since the behavior of the Git client is central to the issue that gave rise to this article, its demonstration program is available only as a Git repository, https://github.com/txwizard/LineBreakFixupsDemo. Like many significant open source repositories, preparing it for use requires a bit more than cloning the repository or downloading and extracting a ZIP file of its contents. Since step-by-step instructions are in the repository README.
md
file, which is prominently displayed on the repository home page, and available for direct viewing at https://github.com/txwizard/LineBreakFixupsDemo/blob/master/README.md, they are omitted from this article. I anticipate that the foregoing background information is enough to explain why these extra steps are necessary.
When LineBreakFixupsDemo.exe
is executed in a command prompt window or by double-clicking its icon in the File Explorer, the processes (I usually call them exercises) listed in Table 2 execute in the order listed, producing output similar to that shown in LineBreakFixupDemo_Complete_Report_20190616_183828.TXT
(found in the root directory of the repository). That file was created as follows.
- Double-click the program file in the File Explorer.
- Use the context menu in the upper left corner of its output to select everything in its output window and copy it onto the Windows Clipboard.
- Open a new file in a text editor. Since I wanted control over the type of line breaks that went into the saved file, I used my go-to text editor, UltraEdit Studio.
- Paste the contents of the Windows Clipboard into it.
- Save the file, taking care to specify that I want it saved with Unix line endings.
The choice of Unix line endings was intentional, because it ensures that the file has Unix line breaks when it is compressed into the repository ZIP by the GitHub server. If you build your repository by cloning it, your local copy of the file sports Windows line breaks. This is easy to demonstrate with any programmer’s text editor or any hexadecimal file viewer or editor. I leave the demonstration as an exercise.
You can also execute each of its four exercises independently by appending one of the four parameters listed in the Name column of Table 2.
Table 2 summarizes the command line options for executing individual test sets.
Name
| Description
|
LineBreaks
| Exercising class StringExtensions (Line ending transformation methods)
|
AppSettingsList
| Exercising class Program (method ListAppSettings , which sorts and lists the application settings)
|
StringResourceList
| Exercising class Program (method ListEmbeddedResources , which sorts and lists embedded string resources)
|
TransformJSONString
| Exercising class JSONFixups from files that contain transformed and raw Windows line breaks, and Unix line breaks (3 tests)
|
The subject of this article is the first exercise, LineBreaks
. The remaining three exercises will be the subject of forthcoming articles.
The string transformations are accomplished by three string extension methods.
OldMacLineEndings
accepts a string that contains any combination of line endings, returning a string in which every legitimate line ending is replaced by a legacy Macintosh line ending as described in Table 1. UnixLineEndings
accepts a string that contains any combination of line endings, returning a string in which every legitimate line ending is replaced by a Unix line ending as described in Table 1. WindowsLineEndings
accepts a string that contains any combination of line endings, returning a string in which every legitimate line ending is replaced by a Windows line ending as described in Table 1.
The string extension methods are exported by WizardWrx.Core.dll
, a component of the WizardWrx .NET API, also available as a NuGet package.
Though the NuGet package list shown in Table 3 contains a dozen items, only three must be explicitly installed; the other nine are dependencies that the NuGet package manager downloads when those three are selected for installation. They are dependencies of WizardWrx.ConsoleAppAids3
, WizardWrx.EmbeddedTextFile
, or both. Newtonsoft.Json has no package dependencies. Everything listed here is included in Binaries.zip,
covered in the preparation instructions in README.md
.
Table 3 lists all NuGet packages, indicating for each whether it must be installed.
Package Name
| Must Install
|
Newtonsoft.Json
| Yes
|
WizardWrx.AnyCSV
| No
|
WizardWrx. ASCIIInfo
| No
|
WizardWrx. AssemblyUtils
| No
|
WizardWrx.BitMath
| No
|
WizardWrx.Common
| No
|
WizardWrx.ConsoleAppAids3
| Yes
|
WizardWrx.ConsoleStreams
| No
|
WizardWrx.Core
| No
|
WizardWrx.DLLConfigurationManager
| No
|
WizardWrx.EmbeddedTextFile
| Yes
|
Points of Interest
Apart from the code required to determine whether to skip or run it, the LineBreaks
exercise is implemented by the static Exercise
method on LineEndingFixupTests
, which is also static. Since everything it needs comes from a static array of TestCase
structures that is baked into the program, I saw no point in defining even a basic instance constructor, which would add empty bulk (like empty calories in your diet).
struct TestCase
{
public string InputString;
public int LineCount;
public TestCase ( string pstrInputString , int pintLineCount )
{
InputString = pstrInputString;
LineCount = pintLineCount;
}
};
static readonly TestCase [ ] s_astrTestStrings = new TestCase [ ]
{
new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is also followed by a Unix newline.\nTest line 3 is followed by a Windows line break.\r\nTest line 4 is followed by the unusual line LF/CR line break.\n\rTest line 5 is followed by the old Macintosh line break, CR.\rTest line 6 is followed by a Unix newline.\nTest line 7 is followed by one last Unix newline.\n" , 7 ) ,
new TestCase ( "Test line 1 is followed by Windows newline.\r\nTest line 2 is also followed by a Windows newline.\r\nTest line 3 is followed by a Windows line break.\r\nTest line 4 is followed by the unusual line LF/CR line break.\n\rTest line 5 is followed by the old Macintosh line break, CR.\rTest line 6 is followed by a Unix newline.\nTest line 7 is followed by one last Windows newline.\r\n" , 7 ) ,
new TestCase ( "Test line 1 is followed by a Windows newline.\r\nTest line 2 is also followed by a Windows newline.\r\nTest line 3 is followed by yet another Windows neline.\r\n" , 3 ) ,
new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is also followed by a Unix newline.\nTest line 3 is followed by yet another Unix neline.\n" , 3 ) ,
new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is followed by 2 Unix newlines.\n\nTest line 4 is followed by one Unix newline.\nTest line 5 is unterminated." , 4 ) ,
new TestCase ( "Test line 1 is followed by a Old Macintosh newline.\rTest line 2 is followed by 2 Old Macintosh newlines.\r\rTest line 4 is followed by one Old Macintosh newline.\rTest line 5 is unterminated." , 4 ) ,
new TestCase ( "Test line 1 is followed by a Windows newline.\r\nTest line 2 is followed by 2 Windows newlines.\r\n\r\nTest line 4 is followed by one Windows newline.\r\nTest line 5 is unterminated." , 4 ) ,
};
The flow of the Exercise
method is as follows, and it is a tour de force of extension methods on the System
.String
class.
- Static method
Utl.BeginTest
increments the test number, which is explicitly initialized to zero, and displays the incremented test number as part of a one-line message that describes the test. The descriptions are shown in Table 2 above. - The three integer counter variables are initialized. Though its value never changes at runtime,
intTotalTestCases
cannot be declared as const
, due to the way it is initialized. - Two
WizardWrx.ConsoleStreams.MessageInColor
objects are constructed and initialized, although only micSuccessfulOutcomeMessage
is expected to see action. A MessageInColor
object is analogous to the part of the static Console
object that implements its WriteLine
methods. The difference is that the WriteLine
methods on a MessageInColor
object render text in the foreground (text) and background colors specified in its constructor.
- The colors assigned to the
micSuccessfulOutcomeMessage
object are read from string settings stored in the application settings file. This is accomplished by EnumFromString<ConsoleColor>
, one of several custom extension methods on System
.String
defined in the WizardWrx
namespace. Using this conversion method eliminates the need for a custom converter class. - The colors assigned to the
micUnsuccessfulOutcomeMessage
object are the FatalExceptionTextColor
and FatalExceptionBackgroundColor
properties on the WizardWrx.ConsoleStreams.ErrorMessagesInColor
class. The values assigned to these colors are taken from WizardWrx.DLLConfigurationManager.dll.config
, which travels with WizardWrx.DLLConfigurationManager.dll
. This means that wherever goes WizardWrx.DLLConfigurationManager.dll
, it looks for a like-named configuration file. If present, its settings are applied as if they had been specified in a regular App.config file. Everything has a default value, which happens to be the values specified in the configuration file that is distributed with the library. The goal is that nothing is hard-coded. WizardWrx.DLLConfigurationManager.dll
is another component of the WizardWrx_NET_API and a NuGet package (https://www.nuget.org/packages/WizardWrx.DLLConfigurationManager/). An explanation of how this library works is beyond the scope of this article, though it might become the subject of another one.
- The heart of the
Exercise
method is the for
loop that iterates over the s_astrTestStrings
array. The formatting of this statement is deliberate, and it is intended to call attention to the fact that, like all for
statements, it is three tightly coupled statements. This is true of C and every language that is derived from it. If you watch closely in a debugger, you will notice that the first of the three executes once only, on the first iteration, the second executes on every iteration, and the third executes on all but the first iteration. - Each iteration transforms the input string into a new string that implements one of the three line break conventions listed in Table 1.
- Next, the
ReportTestOutcome
method is called once for each output string. Since two of the three line break conventions requires but one character, while the third requires a two-character string, there are two overloads of ReportTestOutcome
. ReportTestOutcome
is also responsible for incrementing the intOverallCase
counter, which it receives as an input, and returns as its value. SpecialCharacters.CARRIAGE_RETURN
, SpecialCharacters.LINEFEED
, and SpecialStrings.STRING_SPLIT_NEWLINE
are three of the many general-purpose constants defined by WizardWrx.Common.dll
and exported into the root WizardWrx
namespace. This library is another component of the WizardWrx .NET API and the WizardWrx.Common
NuGet package, available from https://www.nuget.org/packages/WizardWrx.Common/. - Finally,
s_intBadOutcomes
and s_intGoodOutcomes
are a pair of static integer counters, of which the first, s_intBadOutcomes,
is tested to determine which of two concluding messages is written through one of the two MessageInColor
objects. Unless the program has a bug or the table of test data becomes corrupted, the first message, “A-OK!” is the expected outcome. - The last interesting bit of the test program is
ActualLineCount
, which is called by both ReportTestOutcome
overloads. Like its caller, and for the same reason, ActualLineCount
is overloaded because its second parameter is passed straight through from ReportTestOutcome
.
- The first overload takes the output string and the Windows line break, another string, and returns the integer returned by
CountSubstrings
, another System
.String
extension method exported into the WizardWrx
namespace. - The second overload takes the output string and a Unix or Macintosh line break, each of which is a character, and returns the integer returned by
CountCharacterOccurrences
, another extension method.
Though these string extension methods are straightforward, they are worth a quick peek. Though the StringExtensions
class properly belongs to another repository, I deposited a copy of StringExtensions.cs
into the root directory of this repository. This class is frequently updated as I discover and implement new string extension methods; to take advantage of them, Import the WizardWrx.Core
NuGet package into your projects.
Following the lead of the standard string comparison methods, there are two overloads of CountSubstrings
, the first of which assumes CurrentCulture
, the default StringComparison
. The first overload calls the second, which allows the comparison algorithm to be overridden. Since its argument list is complete, the second method provides the implementation for both.
public static int CountSubstrings (
this string pstrSource ,
string pstrToCount )
{
return pstrSource.CountSubstrings (
pstrToCount ,
StringComparison.CurrentCulture );
}
public static int CountSubstrings (
this string pstrSource ,
string pstrToCount ,
StringComparison penmComparisonType )
{
if ( string.IsNullOrEmpty ( pstrSource ) )
{
return MagicNumbers.ZERO;
}
if ( string.IsNullOrEmpty ( pstrToCount ) )
{
return MagicNumbers.STRING_INDEXOF_NOT_FOUND;
}
int rintCount = MagicNumbers.ZERO;
int intPos = pstrSource.IndexOf (
pstrToCount ,
penmComparisonType );
while ( intPos != MagicNumbers.STRING_INDEXOF_NOT_FOUND )
{
rintCount++;
intPos = pstrSource.IndexOf (
pstrToCount ,
( intPos + ArrayInfo.NEXT_INDEX ) ,
penmComparisonType );
}
return rintCount;
}
Both methods are defined in StringExtensions.cs
, which is part of the WizardWrx.Core.dll
source code. At last, we turn our attention to the reason this article came into being, UnixLineEndings
, WindowsLineEndings, and OldMacLineEndings, all three of which are also string extension methods, defined in the same source file.
public static string OldMacLineEndings ( this string pstrSource )
{
return LineEndingFixup (
pstrSource ,
RequiredLineEndings.OldMacintosh );
}
public static string UnixLineEndings ( this string pstrSource )
{
return LineEndingFixup (
pstrSource ,
RequiredLineEndings.Unix );
}
public static string WindowsLineEndings ( this string pstrSource )
{
return LineEndingFixup (
pstrSource ,
RequiredLineEndings.Windows );
}
All three call LineEndingFixup
.
private static string LineEndingFixup (
string pstrSource ,
RequiredLineEndings penmRequiredLineEndings )
{
LineEndingFixupState state = new LineEndingFixupState (
penmRequiredLineEndings ,
pstrSource );
int intCharsInLine = state.InputCharacterCount;
for ( int intCurrCharPos = ArrayInfo.ARRAY_FIRST_ELEMENT ;
intCurrCharPos < intCharsInLine ;
intCurrCharPos++ )
{
state.UpdateState ( intCurrCharPos );
}
return state.GetTransformedString ( );
}
The real work is delegated to a private nested class, LineEndingFixupState
, shown below in its entirety.
#region Private nested class LineEndingFixupState
private class LineEndingFixupState
{
#region Public Interface of nested LineEndingFixupState class
public enum CharacterType
{
Indeterminate ,
Other ,
OldMacintosh ,
Unix
};
private LineEndingFixupState ( )
{
}
public LineEndingFixupState (
RequiredLineEndings penmRequiredLineEndings ,
string pstrInput )
{
NewLineEndings = penmRequiredLineEndings;
DesiredLineEnding = SetDesiredLineEnding ( );
_sbWork = new StringBuilder ( pstrInput.Length * MagicNumbers.PLUS_TWO );
_achrInputCharacters = pstrInput.ToCharArray ( );
InputCharacterCount = _achrInputCharacters.Length;
}
public string GetTransformedString ( )
{
return _sbWork.ToString ( );
}
public void UpdateState ( int pintCurrCharPos )
{
char chrCurrent = GetCharacterAtOffset ( pintCurrCharPos );
CharacterType enmCharacterType = ClassifyThisCharacter ( chrCurrent );
if ( IsThisCharANewline ( chrCurrent ) )
{
if ( IsRunOfNelines ( ) )
{
if ( AppendNewline ( pintCurrCharPos , enmCharacterType ) )
{
_sbWork.Append ( DesiredLineEnding );
}
}
else
{
_intPosNewlineRunStart = _intPosNewlineRunStart == ArrayInfo.ARRAY_INVALID_INDEX
? pintCurrCharPos
: _intPosNewlineRunStart;
_sbWork.Append ( DesiredLineEnding );
}
}
else
{
_sbWork.Append ( chrCurrent );
_intPosNewlineRunStart = ArrayInfo.ARRAY_INVALID_INDEX;
}
LastCharacter = enmCharacterType;
}
public string DesiredLineEnding { get; private set; } = null;
public int InputCharacterCount { get; }
public CharacterType LastCharacter { get; private set; } = CharacterType.Indeterminate;
public RequiredLineEndings NewLineEndings { get; private set; }
#endregion // Public Interface of nested LineEndingFixupState class
#region Private nested class LineEndingFixupState code and data
private bool AppendNewline (
int pintCurrCharPos ,
CharacterType penmCurrentCharacterType )
{
const int LONGEST_VALID_NEWLINE_SEQUENCE = MagicNumbers.PLUS_TWO;
switch ( ( pintCurrCharPos - _intPosNewlineRunStart ) % LONGEST_VALID_NEWLINE_SEQUENCE )
{
case MagicNumbers.EVENLY_DIVISIBLE:
return true;
default:
return penmCurrentCharacterType == LastCharacter;
}
}
private CharacterType ClassifyThisCharacter ( char pchrCurrent )
{
switch ( pchrCurrent )
{
case CHAR_SPLIT_OLD_MACINTOSH:
return CharacterType.OldMacintosh;
case CHAR_SPLIT_UNIX:
return CharacterType.Unix;
default:
return CharacterType.Other;
}
}
private bool IsThisCharANewline ( char pchrThis )
{
switch ( pchrThis )
{
case CHAR_SPLIT_OLD_MACINTOSH:
case CHAR_SPLIT_UNIX:
return true;
default:
return false;
}
}
private bool IsRunOfNelines ( )
{
return ( _intPosNewlineRunStart != ArrayInfo.ARRAY_INVALID_INDEX );
}
private char GetCharacterAtOffset ( int pintCurrCharPos )
{
return _achrInputCharacters [ pintCurrCharPos ];
}
private string SetDesiredLineEnding ( )
{
const string WINDOWS_LINE_BREAK = SpecialStrings.STRING_SPLIT_NEWLINE;
switch ( NewLineEndings )
{
case RequiredLineEndings.OldMacintosh:
return CHAR_SPLIT_OLD_MACINTOSH.ToString ( );
case RequiredLineEndings.Unix:
return CHAR_SPLIT_UNIX.ToString ( );
case RequiredLineEndings.Windows:
return WINDOWS_LINE_BREAK;
default:
throw new InvalidEnumArgumentException (
nameof ( NewLineEndings ) ,
( int ) NewLineEndings ,
NewLineEndings.GetType ( ) );
}
}
private readonly char [ ] _achrInputCharacters = null;
private int _intPosNewlineRunStart = ArrayInfo.ARRAY_INVALID_INDEX;
private StringBuilder _sbWork { get; }
#endregion // Private nested class LineEndingFixupState code and data
}
#endregion // Private nested class LineEndingFixupState
All three methods call LineEndingFixup
with the input string and a member of the RequiredLineEndings
enumeration that identifies the type of line breaks required in its output string. Its first task is the construction of a new LineEndingFixupState
instance from both. Next, the length of the input string is retrieved from the LineEndingFixup
object, which is the state machine, after which a loop calls UpdateState
with an index that causes it to evaluate each character in the input string, updating the state as it goes, and generating the desired type of line break when a new line break is encountered. Any character that isn’t part of a line break is appended to a StringBuilder
that the LineEndingFixup
maintains. After the last character is processed, GetTransformedString
is called, and the string that it returns is passed back up the call stack.
- The
UpdateState
method begins by copying the character at the offset indicated by its integer argument into a local character object. Next, ClassifyThisCharacter
, a private method, is called upon to identify the character as one of the two valid line break characters, or something else. Since all three conventions use two characters, alone or in combination, these are the only two that need special treatment. Anything else is appended to the output string.
- If the current character is a line break character, indicated by
IsThisCharANewline
returning true
, instance method IsRunOfNelines
determines whether the current character is one of a series of two or more consecutive line breaks, which is determined by counting the number of consecutive line break characters found. - If the number of consecutive line break characters is evenly divisible by two (the number of valid line-break characters) or the current character and the immediately previous character are identical, a new line break has been found, and a line break of the required type is appended to the output string.
- In any event, the first character of a run of line break characters causes a line break to be appended and initializes the run counter.
- Finally, any character besides a line break causes the line break start position to be reset to
-1
, an invalid value for a character offset.
There is a lot more cool stuff happening in this assembly and the libraries upon which it relies. Stay tuned for further articles about some of it, the next of which, Adapting JSON Strings for Deserializing into C# Objects
History
Wednesday, 19 June 2019: Initial publication