Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Line Breaks: From Windows to Unix and Back

5.00/5 (3 votes)
23 Jun 2019CPOL13 min read 14.2K  
This is the first of a two-part series about string conversion issues that recently arose in my practice.

Introduction

This article is about a workaround that I developed to compensate for unexpected behavior of the Git clone command when it downloads a file that contains Unix line breaks and converts them to Windows line breaks, while the application expects Unix line breaks.

Background

A Git repository may include an optional .gitattributes file that instructs Git clients about how to handle line breaks in text files. While this works well for transferring files upstream to an origin repository on GitHub, Bitbucket, or TFS, the Git client either ignores or misinterprets its instructions when the file is downloaded as part of a clone, pull, or merge to a local repository. I am unsure whether clone, merge, and pull ignore .gitattributes, or misinterprets it.

Regardless, the outcome is the same; the text file cannot be assumed to contain Unix line breaks after a round trip through Git.

Today, there are three line break conventions in widespread use, which are summarized in Table 1 below. To some readers, the third convention, Legacy Macintosh, may be confusing; since the operating system on modern Macintosh computers is a heavily customized version of Linux, they follow the Unix line break conventions.

Table 1 summarizes the common line break conventions in use today.

Convention Character Codes Comment
Windows 0x0d0a Carriage Return followed by Line Feed
Unix 0a Line Feed
Legacy Macintosh 0d Carriage Return

By default, when Git clients running on Windows transfer a text file to a remote repository, Windows line breaks are replaced with Unix line breaks. Conversely, when a text file that contains Unix line breaks is transferred from a remote Git repository to a local repository on a Windows host, the Unix line breaks are replaced with Windows line breaks.

In most cases, this transformation is transparent and desirable. However, the case that gave rise to this article is an exception, because some of the substrings that must be replaced in the JSON string contain embedded line breaks that are significant because they are critical to accurately identifying the substrings. Moreover, the replacement strings also contain line breaks. While the JSON parser that eventually processes the string may be immune to these issues, my preprocessor is not.

The foregoing explains why the repository includes Test_Data.zip.

The other archive, Binaries.zip, incorporates the compiled code and intermediate files that are, by convention, excluded from source code control.

  1. The bin directory contains the final products, including copies of the required DLLs.
  2. The obj directory contains the intermediate files generated by the build process; without them, the build engine will insist on rebuilding them the first time you press F5 to run the code in Visual Studio.

Both archives are structured so that they reproduce the intended directory structure when both are extracted “here,” as explained in README.md.

Though there are undoubtedly scores of libraries that I could have employed to resolve the issue, since I had a bit of time, I decided to roll my own. It is not as easy as it looks, but the task lends itself to the application of a simple state machine similar to the one that drives A Robust CSV Reader. (The latest source is on GitHub at https://github.com/txwizard/AnyCSV, and the corresponding NuGet package is at https://www.nuget.org/packages/WizardWrx.AnyCSV/.)

Using the Code

Since the behavior of the Git client is central to the issue that gave rise to this article, its demonstration program is available only as a Git repository, https://github.com/txwizard/LineBreakFixupsDemo. Like many significant open source repositories, preparing it for use requires a bit more than cloning the repository or downloading and extracting a ZIP file of its contents. Since step-by-step instructions are in the repository README.md file, which is prominently displayed on the repository home page, and available for direct viewing at https://github.com/txwizard/LineBreakFixupsDemo/blob/master/README.md, they are omitted from this article. I anticipate that the foregoing background information is enough to explain why these extra steps are necessary.

When LineBreakFixupsDemo.exe is executed in a command prompt window or by double-clicking its icon in the File Explorer, the processes (I usually call them exercises) listed in Table 2 execute in the order listed, producing output similar to that shown in LineBreakFixupDemo_Complete_Report_20190616_183828.TXT (found in the root directory of the repository). That file was created as follows.

  1. Double-click the program file in the File Explorer.
  2. Use the context menu in the upper left corner of its output to select everything in its output window and copy it onto the Windows Clipboard.
  3. Open a new file in a text editor. Since I wanted control over the type of line breaks that went into the saved file, I used my go-to text editor, UltraEdit Studio.
  4. Paste the contents of the Windows Clipboard into it.
  5. Save the file, taking care to specify that I want it saved with Unix line endings.

The choice of Unix line endings was intentional, because it ensures that the file has Unix line breaks when it is compressed into the repository ZIP by the GitHub server. If you build your repository by cloning it, your local copy of the file sports Windows line breaks. This is easy to demonstrate with any programmer’s text editor or any hexadecimal file viewer or editor. I leave the demonstration as an exercise.

You can also execute each of its four exercises independently by appending one of the four parameters listed in the Name column of Table 2.

Table 2 summarizes the command line options for executing individual test sets.

Name Description
LineBreaks Exercising class StringExtensions (Line ending transformation methods)
AppSettingsList Exercising class Program (method ListAppSettings, which sorts and lists the application settings)
StringResourceList Exercising class Program (method ListEmbeddedResources, which sorts and lists embedded string resources)
TransformJSONString Exercising class JSONFixups from files that contain transformed and raw Windows line breaks, and Unix line breaks (3 tests)

The subject of this article is the first exercise, LineBreaks. The remaining three exercises will be the subject of forthcoming articles.

The string transformations are accomplished by three string extension methods.

  1. OldMacLineEndings accepts a string that contains any combination of line endings, returning a string in which every legitimate line ending is replaced by a legacy Macintosh line ending as described in Table 1.
  2. UnixLineEndings accepts a string that contains any combination of line endings, returning a string in which every legitimate line ending is replaced by a Unix line ending as described in Table 1.
  3. WindowsLineEndings accepts a string that contains any combination of line endings, returning a string in which every legitimate line ending is replaced by a Windows line ending as described in Table 1.

The string extension methods are exported by WizardWrx.Core.dll, a component of the WizardWrx .NET API, also available as a NuGet package.

Though the NuGet package list shown in Table 3 contains a dozen items, only three must be explicitly installed; the other nine are dependencies that the NuGet package manager downloads when those three are selected for installation. They are dependencies of WizardWrx.ConsoleAppAids3, WizardWrx.EmbeddedTextFile, or both. Newtonsoft.Json has no package dependencies. Everything listed here is included in Binaries.zip, covered in the preparation instructions in README.md.

Table 3 lists all NuGet packages, indicating for each whether it must be installed.

Package Name Must Install
Newtonsoft.Json Yes
WizardWrx.AnyCSV No
WizardWrx.ASCIIInfo

No

WizardWrx.AssemblyUtils No
WizardWrx.BitMath No
WizardWrx.Common No

WizardWrx.ConsoleAppAids3

Yes
WizardWrx.ConsoleStreams No
WizardWrx.Core No
WizardWrx.DLLConfigurationManager No
WizardWrx.EmbeddedTextFile Yes

Points of Interest

Apart from the code required to determine whether to skip or run it, the LineBreaks exercise is implemented by the static Exercise method on LineEndingFixupTests, which is also static. Since everything it needs comes from a static array of TestCase structures that is baked into the program, I saw no point in defining even a basic instance constructor, which would add empty bulk (like empty calories in your diet).

C#
struct TestCase
{
    public string InputString;
    public int LineCount;

    public TestCase ( string pstrInputString , int pintLineCount )
    {
        InputString = pstrInputString;
        LineCount = pintLineCount;
    }
};  // struct TestCases
C#
static readonly TestCase [ ] s_astrTestStrings = new TestCase [ ]

{   //             InputString                                                                                                                                                                                                                                                                                                                                                                                      LineCount
    //             ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   ---------
    new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is also followed by a Unix newline.\nTest line 3 is followed by a Windows line break.\r\nTest line 4 is followed by the unusual line LF/CR line break.\n\rTest line 5 is followed by the old Macintosh line break, CR.\rTest line 6 is followed by a Unix newline.\nTest line 7 is followed by one last Unix newline.\n"              , 7         ) ,
    new TestCase ( "Test line 1 is followed by Windows newline.\r\nTest line 2 is also followed by a Windows newline.\r\nTest line 3 is followed by a Windows line break.\r\nTest line 4 is followed by the unusual line LF/CR line break.\n\rTest line 5 is followed by the old Macintosh line break, CR.\rTest line 6 is followed by a Unix newline.\nTest line 7 is followed by one last Windows newline.\r\n" , 7         ) ,
    new TestCase ( "Test line 1 is followed by a Windows newline.\r\nTest line 2 is also followed by a Windows newline.\r\nTest line 3 is followed by yet another Windows neline.\r\n"                                                                                                                                                                                                                            , 3         ) ,
    new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is also followed by a Unix newline.\nTest line 3 is followed by yet another Unix neline.\n"                                                                                                                                                                                                                                           , 3         ) ,
    new TestCase ( "Test line 1 is followed by a Unix newline.\nTest line 2 is followed by 2 Unix newlines.\n\nTest line 4 is followed by one Unix newline.\nTest line 5 is unterminated."                                                                                                                                                                                                                        , 4         ) ,
    new TestCase ( "Test line 1 is followed by a Old Macintosh newline.\rTest line 2 is followed by 2 Old Macintosh newlines.\r\rTest line 4 is followed by one Old Macintosh newline.\rTest line 5 is unterminated."                                                                                                                                                                                             , 4         ) ,
    new TestCase ( "Test line 1 is followed by a Windows newline.\r\nTest line 2 is followed by 2 Windows newlines.\r\n\r\nTest line 4 is followed by one Windows newline.\r\nTest line 5 is unterminated."                                                                                                                                                                                                       , 4         ) ,
};  // static readonly TestCase [ ] s_astrTestStrings

The flow of the Exercise method is as follows, and it is a tour de force of extension methods on the System.String class.

  • Static method Utl.BeginTest increments the test number, which is explicitly initialized to zero, and displays the incremented test number as part of a one-line message that describes the test. The descriptions are shown in Table 2 above.
  • The three integer counter variables are initialized. Though its value never changes at runtime, intTotalTestCases cannot be declared as const, due to the way it is initialized.
  • Two WizardWrx.ConsoleStreams.MessageInColor objects are constructed and initialized, although only micSuccessfulOutcomeMessage is expected to see action. A MessageInColor object is analogous to the part of the static Console object that implements its WriteLine methods. The difference is that the WriteLine methods on a MessageInColor object render text in the foreground (text) and background colors specified in its constructor.
    • The colors assigned to the micSuccessfulOutcomeMessage object are read from string settings stored in the application settings file. This is accomplished by EnumFromString<ConsoleColor>, one of several custom extension methods on System.String defined in the WizardWrx namespace. Using this conversion method eliminates the need for a custom converter class.
    • The colors assigned to the micUnsuccessfulOutcomeMessage object are the FatalExceptionTextColor and FatalExceptionBackgroundColor properties on the WizardWrx.ConsoleStreams.ErrorMessagesInColor class. The values assigned to these colors are taken from WizardWrx.DLLConfigurationManager.dll.config, which travels with WizardWrx.DLLConfigurationManager.dll. This means that wherever goes WizardWrx.DLLConfigurationManager.dll, it looks for a like-named configuration file. If present, its settings are applied as if they had been specified in a regular App.config file. Everything has a default value, which happens to be the values specified in the configuration file that is distributed with the library. The goal is that nothing is hard-coded.
    • WizardWrx.DLLConfigurationManager.dll is another component of the WizardWrx_NET_API and a NuGet package (https://www.nuget.org/packages/WizardWrx.DLLConfigurationManager/). An explanation of how this library works is beyond the scope of this article, though it might become the subject of another one.
  • The heart of the Exercise method is the for loop that iterates over the s_astrTestStrings array. The formatting of this statement is deliberate, and it is intended to call attention to the fact that, like all for statements, it is three tightly coupled statements. This is true of C and every language that is derived from it. If you watch closely in a debugger, you will notice that the first of the three executes once only, on the first iteration, the second executes on every iteration, and the third executes on all but the first iteration.
  • Each iteration transforms the input string into a new string that implements one of the three line break conventions listed in Table 1.
  • Next, the ReportTestOutcome method is called once for each output string. Since two of the three line break conventions requires but one character, while the third requires a two-character string, there are two overloads of ReportTestOutcome. ReportTestOutcome is also responsible for incrementing the intOverallCase counter, which it receives as an input, and returns as its value.
  • SpecialCharacters.CARRIAGE_RETURN, SpecialCharacters.LINEFEED, and SpecialStrings.STRING_SPLIT_NEWLINE are three of the many general-purpose constants defined by WizardWrx.Common.dll and exported into the root WizardWrx namespace. This library is another component of the WizardWrx .NET API and the WizardWrx.Common NuGet package, available from https://www.nuget.org/packages/WizardWrx.Common/.
  • Finally, s_intBadOutcomes and s_intGoodOutcomes are a pair of static integer counters, of which the first, s_intBadOutcomes, is tested to determine which of two concluding messages is written through one of the two MessageInColor objects. Unless the program has a bug or the table of test data becomes corrupted, the first message, “A-OK!” is the expected outcome.
  • The last interesting bit of the test program is ActualLineCount, which is called by both ReportTestOutcome overloads. Like its caller, and for the same reason, ActualLineCount is overloaded because its second parameter is passed straight through from ReportTestOutcome.
    • The first overload takes the output string and the Windows line break, another string, and returns the integer returned by CountSubstrings, another System.String extension method exported into the WizardWrx namespace.
    • The second overload takes the output string and a Unix or Macintosh line break, each of which is a character, and returns the integer returned by CountCharacterOccurrences, another extension method.

Though these string extension methods are straightforward, they are worth a quick peek. Though the StringExtensions class properly belongs to another repository, I deposited a copy of StringExtensions.cs into the root directory of this repository. This class is frequently updated as I discover and implement new string extension methods; to take advantage of them, Import the WizardWrx.Core NuGet package into your projects.

Following the lead of the standard string comparison methods, there are two overloads of CountSubstrings, the first of which assumes CurrentCulture, the default StringComparison. The first overload calls the second, which allows the comparison algorithm to be overridden. Since its argument list is complete, the second method provides the implementation for both.

C#
public static int CountSubstrings (
    this string pstrSource ,
    string pstrToCount )
{
    return pstrSource.CountSubstrings (
        pstrToCount ,
        StringComparison.CurrentCulture );
}   // CountSubstrings (1 of 2)
C#
public static int CountSubstrings (
        this string pstrSource ,
        string pstrToCount ,
        StringComparison penmComparisonType )
{
    if ( string.IsNullOrEmpty ( pstrSource ) )
    {   // Treat null strings as empty, and treat both as a valid, but degenerate, case.
        return MagicNumbers.ZERO;
    }   // if ( string.IsNullOrEmpty ( pstrSource ) )

    if ( string.IsNullOrEmpty ( pstrToCount ) )
    {   // This is an error. String pstrToCount should never be null or empty.
        return MagicNumbers.STRING_INDEXOF_NOT_FOUND;
    }   // if ( string.IsNullOrEmpty ( pstrToCount ) )

    int rintCount = MagicNumbers.ZERO;

    // ----------------------------------------------------------------
    // Unless pstrSource contains at least one instance of pstrToCount,
    // this first IndexOf is the only one that executes.
    //
    // If there are no matches, intPos is STRING_INDEXOF_NOT_FOUND (-1)
    // and the WHILE loop is skipped. Hence, if control falls into the
    // loop, at least one item was found, and must be counted, and the
    // loop continues until intPos becomes STRING_INDEXOF_NOT_FOUND.
    // ----------------------------------------------------------------

    int intPos = pstrSource.IndexOf (
        pstrToCount ,
        penmComparisonType );                                               

    // Look for first instance.

    while ( intPos != MagicNumbers.STRING_INDEXOF_NOT_FOUND )
    {   // Found at least one.
        rintCount++;    // Count it.
        intPos = pstrSource.IndexOf (
            pstrToCount ,
            ( intPos + ArrayInfo.NEXT_INDEX ) ,
            penmComparisonType );    // Search for more.
    }  // while ( intPos != MagicNumbers.STRING_INDEXOF_NOT_FOUND )

    return rintCount;    // Report.
}   // CountSubstrings (2 of 2)

Both methods are defined in StringExtensions.cs, which is part of the WizardWrx.Core.dll source code. At last, we turn our attention to the reason this article came into being, UnixLineEndings, WindowsLineEndings, and OldMacLineEndings, all three of which are also string extension methods, defined in the same source file.

C#
public static string OldMacLineEndings ( this string pstrSource )
{
    return LineEndingFixup (
        pstrSource ,                        // string                  pstrSource
        RequiredLineEndings.OldMacintosh ); // RequiredLineEndings     penmRequiredLineEndings
}   // OldMacLineEndings method

public static string UnixLineEndings ( this string pstrSource )
{
    return LineEndingFixup (
        pstrSource ,                        // string                  pstrSource
                RequiredLineEndings.Unix ); // RequiredLineEndings     penmRequiredLineEndings
}   // UnixLineEndings method

public static string WindowsLineEndings ( this string pstrSource )
{
    return LineEndingFixup (
        pstrSource ,                        // string                  pstrSource
        RequiredLineEndings.Windows );      // RequiredLineEndings     penmRequiredLineEndings
}   // WindowsLineEndings method

All three call LineEndingFixup.

C#
private static string LineEndingFixup (
    string pstrSource ,
    RequiredLineEndings penmRequiredLineEndings )
{
    //  ----------------------------------------------------------------
    //  Construct a StringBuilder with sufficient memory allocated to
    //  support a final string twice as long as the input string, which
    //  covers the worst-case scenario of an input string composed
    //  entirely of single-character newlines, expecting the returned
    //  string to have Windows line endings.
    //
    //  Copy the input string into an array of characters and initialize
    //  the state machine. Since both are easier to maintain as part of
    //  the state machine, LineEndingFixupState, the input string is fed
    //  into its constructor, since it can construct both from it.
    //  ----------------------------------------------------------------

    LineEndingFixupState state = new LineEndingFixupState (
         penmRequiredLineEndings ,
         pstrSource );

    //  ----------------------------------------------------------------
    //  Using the state machine, a single pass over the character array
    //  is sufficient.
    //  ----------------------------------------------------------------

    int intCharsInLine = state.InputCharacterCount;

    for ( int intCurrCharPos = ArrayInfo.ARRAY_FIRST_ELEMENT ;
              intCurrCharPos < intCharsInLine ;
              intCurrCharPos++ )
    {
        state.UpdateState ( intCurrCharPos );
    }   // for ( int intCurrCharPos = ArrayInfo.ARRAY_FIRST_ELEMENT ; intCurrCharPos < intCharsInLine ; intCurrCharPos++ )

    return state.GetTransformedString ( );
}   // private static string LineEndingFixup

The real work is delegated to a private nested class, LineEndingFixupState, shown below in its entirety.

C#
#region Private nested class LineEndingFixupState
/// <summary>
/// On behalf of public static methods OldMacLineEndings,
/// UnixLineEndings, and WindowsLineEndings, private static method
/// LineEndingFixup uses an instance of this class to manage the
/// resources required to perform its work. Public method UpdateState
/// does most of the work required by LineEndingFixup.
/// </summary>
private class LineEndingFixupState
{
    #region Public Interface of nested LineEndingFixupState class
    /// <summary>
    /// This enumeration is used internally to indicate the state of the
    /// LineEndingFixup state machine.
    /// </summary>
    public enum CharacterType
    {
        /// <summary>
        /// The initial state of the machine is that the last character seen
        /// is unknown.
        /// </summary>
        Indeterminate ,

        /// <summary>
        /// The last character seen isn't a newline character.
        /// </summary>
        Other ,

        /// <summary>
        /// The last character seen was a bare CR character, which is either
        /// the old Macintosh line break character, or belongs to a
        /// two-character line break.
        /// </summary>
        OldMacintosh ,

        /// <summary>
        /// The last character seen was a bare LF character, which is either
        /// a Unix line break character, or belongs to a two-character line
        /// break.
        /// </summary>
        Unix
    };  // public enum CharacterType


    /// <summary>
    /// The constructor is kept private to guarantee that all instances
    /// are fully initialized.
    /// </summary>
    private LineEndingFixupState ( )
    {
    }   // private LineEndingFixupState constructor prohibits uninitialized instances.


    /// <summary>
    /// The only public constructor initializes an instance for use by
    /// LineEndingFixup.
    /// </summary>
    /// <param name="penmRequiredLineEndings">
    /// LineEndingFixup uses a member of the RequiredLineEndings
    /// enumeration to specify the type of line endings to be included
    /// in the new string that it generates from input string
    /// <paramref name="pstrInput"/>.
    /// </param>
    /// <param name="pstrInput">
    /// Existing line endings in this string are replaced as needed by
    /// the type of line endings specified by <paramref name="penmRequiredLineEndings"/>.
    /// </param>
    public LineEndingFixupState (
        RequiredLineEndings penmRequiredLineEndings ,
        string pstrInput )
    {
        NewLineEndings = penmRequiredLineEndings;
        DesiredLineEnding = SetDesiredLineEnding ( );
        _sbWork = new StringBuilder ( pstrInput.Length * MagicNumbers.PLUS_TWO );
        _achrInputCharacters = pstrInput.ToCharArray ( );
        InputCharacterCount = _achrInputCharacters.Length;
    }   // public NewLineEndings constructor guarantees initialized instances.


    /// <summary>
    /// Since the StringBuilder is a reference type, exposing it makes
    /// it vulnerable to attack. Therefore, it is kept private, and this
    /// instance method must be explicitly called to get its value as an
    /// immutable entity, a new string.
    /// </summary>
    /// <returns>
    /// The contents of the StringBuilder in which the transformed
    /// string is assembled are returned as a new string.
    /// </returns>
    public string GetTransformedString ( )
    {
        return _sbWork.ToString ( );
    }   // public string GetTransformedString


    /// <summary>
    /// LineEndingFixup calls this method once for each character in the
    /// string that was fed into the instance constructor, and once more
    /// to handle the final character. The algorithm that it implements
    /// completes all conversions in one pass.
    /// </summary>
    /// <param name="pintCurrCharPos">
    /// The index of the FOR loop within which this routine is called
    /// identifies the zero-based position within the internal array of
    /// characters that is constructed from the input string to process.
    /// </param>
    public void UpdateState ( int pintCurrCharPos )
    {
        //  ------------------------------------------------------------
        //  Processing a scalar is slightly more efficient than
        //  processing an array element.
        //  ------------------------------------------------------------

        char chrCurrent = GetCharacterAtOffset ( pintCurrCharPos );

        //  ------------------------------------------------------------
        //  Defer updating the instance property, which would otherwise
        //  break the test performed by IsRunOfNelines.
        //  ------------------------------------------------------------

        CharacterType enmCharacterType = ClassifyThisCharacter ( chrCurrent );

        if ( IsThisCharANewline ( chrCurrent ) )
        {
            if ( IsRunOfNelines ( ) )
            {   // Some newlines are pairs, of which the second is ignored.
                if ( AppendNewline ( pintCurrCharPos , enmCharacterType ) )
                {
                    _sbWork.Append ( DesiredLineEnding );
                }   // if ( AppendNewline ( pintCurrCharPos , enmCharacterType ) )
            }   // TRUE block, if ( IsRunOfNelines ( ) )
            else
            {   // Regardless, the first character elicits a newline, and set the run counter.
                _intPosNewlineRunStart = _intPosNewlineRunStart == ArrayInfo.ARRAY_INVALID_INDEX
                    ? pintCurrCharPos
                    : _intPosNewlineRunStart;
                _sbWork.Append ( DesiredLineEnding );
            }   // FALSE block, if ( IsRunOfNelines ( ) )
        }   // if ( IsThisCharANewline ( chrCurrent ) )
        else
        {   // It isn't a newline; append it, and reset the run counter.
            _sbWork.Append ( chrCurrent );
            _intPosNewlineRunStart = ArrayInfo.ARRAY_INVALID_INDEX;
        }   // if ( IsThisCharANewline ( chrCurrent ) )

        LastCharacter = enmCharacterType;
    }   // public void UpdateState


    /// <summary>
    /// Strictly speaking this string could be left private. Making it
    /// public as a read-only member is a debugging aid that preserves
    /// the integrity of the instance.
    /// </summary>
    public string DesiredLineEnding { get; private set; } = null;


    /// <summary>
    /// The FOR loop at the heart of LineEndingFixup initializes its
    /// limit value from this read-only property.
    /// </summary>
    public int InputCharacterCount { get; }


    /// <summary>
    /// Like DesiredLineEnding, this could be kept private, but is made
    /// public as a debugging aid.
    /// </summary>
    public CharacterType LastCharacter { get; private set; } = CharacterType.Indeterminate;


    /// <summary>
    /// Like DesiredLineEnding, this could be kept private, but is made
    /// public as a debugging aid.
    /// </summary>
    public RequiredLineEndings NewLineEndings { get; private set; }
    #endregion  // Public Interface of nested LineEndingFixupState class


    #region Private nested class LineEndingFixupState code and data
    /// <summary>
    /// Use the current character position relative to the beginning of
    /// the run of newline characters and the type of the current and
    /// immediately previous character in the run to determine whether a
    /// newline should be emitted.
    /// </summary>
    /// <param name="pintCurrCharPos">
    /// The position (offset) of the current character is compared with
    /// the position of the first character in the current run of
    /// newline characters to determine whether to append a newline.
    /// </param>
    /// <param name="penmCurrentCharacterType"></param>
    /// <returns></returns>
    private bool AppendNewline (
        int pintCurrCharPos ,
        CharacterType penmCurrentCharacterType )
    {
        const int LONGEST_VALID_NEWLINE_SEQUENCE = MagicNumbers.PLUS_TWO;

        switch ( ( pintCurrCharPos - _intPosNewlineRunStart ) % LONGEST_VALID_NEWLINE_SEQUENCE )
        {
            case MagicNumbers.EVENLY_DIVISIBLE:
                return true;
            default:
                return penmCurrentCharacterType == LastCharacter;
        }   // switch ( ( pintCurrCharPos - _intPosNewlineRunStart ) % LONGEST_VALID_NEWLINE_SEQUENCE )
    }   // private bool AppendNewline

   
    /// <summary>
    /// Update the LastCharacter property (CharacterType enum).
    /// </summary>
    /// <param name="pchrCurrent">
    /// Pass in the current character, which is about to become the last
    /// character processed.
    /// </param>
    private CharacterType ClassifyThisCharacter ( char pchrCurrent )
    {
        switch ( pchrCurrent )
        {
            case CHAR_SPLIT_OLD_MACINTOSH:
                return CharacterType.OldMacintosh;
            case CHAR_SPLIT_UNIX:
                return CharacterType.Unix;
            default:
                return CharacterType.Other;
        }   // switch ( pchrCurrent )
    }   // private void ClassifyThisCharacter


    /// <summary>
    /// Evaluate the character at a specified position in the input
    /// string, returning TRUE if it is a newline character (CR or LF).
    /// </summary>
    /// <param name="pchrThis">
    /// Specify the character to evaluate.
    /// </param>
    /// <returns>
    /// Return TRUE if the character at the position specified by
    /// <paramref name="pchrThis"/> is a newline (CR or LF)
    /// character. Otherwise, return FALSE.
    /// </returns>
    private bool IsThisCharANewline ( char pchrThis )
    {
        switch ( pchrThis )
        {
            case CHAR_SPLIT_OLD_MACINTOSH:
            case CHAR_SPLIT_UNIX:
                return true;
            default:
                return false;
        }   // switch ( pchrThis )
    }   // private bool IsThisCharANewline


    /// <summary>
    /// Determine whether the current newline character belongs to a run of them.
    /// </summary>
    /// <returns>
    /// Return TRUE unless _intPosLastNewlineChar is equal to
    /// ArrayInfo.ARRAY_INVALID_INDEX; otherwise, return FALSE. Though
    /// this method could go ahead and update _intPosLastNewlineChar, it
    /// leaves it unchanges, so that its execution is devoid of side
    /// effects.
    /// </returns>
    private bool IsRunOfNelines ( )
    {
        return ( _intPosNewlineRunStart != ArrayInfo.ARRAY_INVALID_INDEX );
    }   // private bool IsRunOfNelines


    /// <summary>
    /// Switch blocks in public instance method UpdateState use this
    /// routine to return the character at the position (offset)
    /// specified by <paramref name="pintCurrCharPos"/>.
    /// </summary>
    /// <param name="pintCurrCharPos">
    /// The zero-based offset that was fed into instance method
    /// UpdateState by its controler, LineEndingFixup
    /// </param>
    /// <returns>
    /// The return value is the character at the specified position
    /// (offset) in the input string, a copy of which is maintained in
    /// private character array _achrInputCharacters. Returning this in
    /// a method exposes the actual character that determines which
    /// branch of the switch block executes. Otherwise, the debugger
    /// reports only the return value returned by the property getter.
    /// It is anticipated that this routine will be optimized away in a
    /// release build.
    /// </returns>
    private char GetCharacterAtOffset ( int pintCurrCharPos )
    {
        return _achrInputCharacters [ pintCurrCharPos ];
    }   // private char GetCharacterAtOffset


    /// <summary>
    /// The public constructor invokes this method once, during the
    /// initialization phase, to establish the value of the desired line
    /// ending, which may be a single character or a pair of them.
    /// </summary>
    /// <returns>
    /// The return value is always a string that contains one character or a
    /// pair of them.
    /// </returns>
    private string SetDesiredLineEnding ( )
    {
        const string WINDOWS_LINE_BREAK = SpecialStrings.STRING_SPLIT_NEWLINE;

        switch ( NewLineEndings )
        {
            case RequiredLineEndings.OldMacintosh:
                return CHAR_SPLIT_OLD_MACINTOSH.ToString ( );
            case RequiredLineEndings.Unix:
                return CHAR_SPLIT_UNIX.ToString ( );
            case RequiredLineEndings.Windows:
                return WINDOWS_LINE_BREAK;
            default:
                throw new InvalidEnumArgumentException (
                    nameof ( NewLineEndings ) ,
                    ( int ) NewLineEndings ,
                    NewLineEndings.GetType ( ) );
        }   // switch ( NewLineEndings )
    }   // private string SetDesiredLineEnding


    /// <summary>
    /// The constructor initializes this character array from the input
    /// string. Thereafter, public method UpdateState processes it one
    /// character at a time.
    /// </summary>
    private readonly char [ ] _achrInputCharacters = null;


    /// <summary>
    /// When two or more newline characters appear in a sequence, it is
    /// essential to determine whether they are a pair and, if so, treat
    /// them as such.
    /// </summary>
    private int _intPosNewlineRunStart = ArrayInfo.ARRAY_INVALID_INDEX;


    /// <summary>
    /// Since the StringBuilder is a reference type, exposing it makes
    /// it vulnerable to attack. Therefore, it is kept private, and a
    /// public instance method, GetTransformedString, must be explicitly
    /// called to get its value as a new string, an immutable entity.
    /// </summary>
    private StringBuilder _sbWork { get; }
    #endregion  // Private nested class LineEndingFixupState code and data
}   // private class LineEndingFixupState
#endregion  // Private nested class LineEndingFixupState

All three methods call LineEndingFixup with the input string and a member of the RequiredLineEndings enumeration that identifies the type of line breaks required in its output string. Its first task is the construction of a new LineEndingFixupState instance from both. Next, the length of the input string is retrieved from the LineEndingFixup object, which is the state machine, after which a loop calls UpdateState with an index that causes it to evaluate each character in the input string, updating the state as it goes, and generating the desired type of line break when a new line break is encountered. Any character that isn’t part of a line break is appended to a StringBuilder that the LineEndingFixup maintains. After the last character is processed, GetTransformedString is called, and the string that it returns is passed back up the call stack.

  • The UpdateState method begins by copying the character at the offset indicated by its integer argument into a local character object. Next, ClassifyThisCharacter, a private method, is called upon to identify the character as one of the two valid line break characters, or something else. Since all three conventions use two characters, alone or in combination, these are the only two that need special treatment. Anything else is appended to the output string.
  • If the current character is a line break character, indicated by IsThisCharANewline returning true, instance method IsRunOfNelines determines whether the current character is one of a series of two or more consecutive line breaks, which is determined by counting the number of consecutive line break characters found.
  • If the number of consecutive line break characters is evenly divisible by two (the number of valid line-break characters) or the current character and the immediately previous character are identical, a new line break has been found, and a line break of the required type is appended to the output string.
  • In any event, the first character of a run of line break characters causes a line break to be appended and initializes the run counter.
  • Finally, any character besides a line break causes the line break start position to be reset to -1, an invalid value for a character offset.

There is a lot more cool stuff happening in this assembly and the libraries upon which it relies. Stay tuned for further articles about some of it, the next of which, Adapting JSON Strings for Deserializing into C# Objects

History

Wednesday, 19 June 2019: Initial publication

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)