(untagged)

CleanR - Text file string search and replace engine.

Kamal Shankar

0.00/5 (No votes)

1 Jul 2003

An engine to allow you to search for strings in text files and replace them.

Introduction

This was actually an idea that somehow entered my head after getting hundreds of e-mails from people asking me to help them clean their systems infected with the Win32.Redlof.A HTML based script following my own article on the script based virus at http://neworder.box.sk titled "Paper on the Win32.Redlof.A virus" (or so)

I set out to write a small utility which would find the infected files and remove the offending code, and so CleanR was born. What this utility does is search a specific string occurrence inside TEXT (ASCII) files and replace them with another string which may also be NULL (which amounts to replacing the search string fully).

Thus to remove the virus string, I put the Search string to the virus script and the replacement string to NULL. Thankfully this approach worked, and everybody lived happily ever after.

Soon after, people were also using my little tool from removing advertisement and popup script lines from saved HTML pages to correcting typos in their text files.

Last time I heard, someone was recommending my utility to replace the following script from saved Yahoo! mail pages:

<script>
<!-- 
    function Help(link)
    {
    window.open(link,"help",
           "width=400,height=500,scrollbars=yes,dependent=yes");
    }
    if (document.cookie.indexOf("o/Rl.A") == -1) {
    window.open("http://mail.yahoo.com", "_top");
    }
// -->

</script>

By specifying the above script as the search string and keeping the replacement string to NULL, the script was removed and they were able to view their saved mail again! (Otherwise, with the above script only a window to http://mail.yahoo.com would have opened, instead of the saved mail!) (If I am correct, Yahoo! has stopped putting the above script into their pages since !)

What you can now do with the options

"Check only words" : Ask the utility to replace specifically words instead of substrings, ie, if the program comes across a string "WeAreAllFriends" and you want to replace "Friends" with "Foes", it will NOT replace this string, but if the string was "We are all Friends", then "Friends" would be replaced to "We are all Foes"
Please note that by default, ALL text manipulations are case sensitive - "Friends" and "friends" are two DIFFERENT strings by default ! You just have to uncheck the "Case sensitive" checkbox to make a case insensitive search.
Note: During a case insensitive search, this program will consume DOUBLE the memory than the same case sensitive search.
"Be strict on grammar" : Ask the utility to change ONLY those words which have correct grammar - e.g., If Search string is "Friends" , and the program comes across:
```
"What kind of Friends!Wish I died!"
"What kind of Friends;wish I died!"
"What kind of Friends!-Wish I died!"
```
etc, they are ignored. Examples of valid strings under "Strict grammar" option are:
```
"What kind of Friends! Wish I died!"
"What kind of Friends, Wish I died!"
"What kind of Friends - Wish I died!"
"What kind of Friends wish I died!"
```
[New sentence starts with capitals etc conditions NOT checked now]

As this option automatically includes the condition for checking ONLY words, the separate option will be disabled once you select this option.
"Ignore last word punctuation criteria" : Under "Strict Grammar" option, you can have the choice "Ignore last word punctuation"
This option does not check the grammar validity of the last word in a text file if the last word matches the Search string.
Example, if "Bye" is the search string, and the above option is checked, and iff "Bye' is the last word in the text file without proper punctuation mark(s) following (generally the period) as in:
"With lotsa love - Bye" being the LAST line in the text file, the fact that "Bye" does not have a required punctuation mark (the period) after it will be ignored, and it will NOT be ignored as it would otherwise have been if the option had NOT been enabled
"Overwrite original file" : Ask the program to OverWrite the original file with the modified file.
"Get search string from file" : On checking this box, a file dialog will open where you can select the text file containing your search string. This will also disable the search string edit box.
"Get replacement string from file" : On checking this box, a file dialog will open where you can select the text file containing the string to replace the search string in matching files.
This will also disable the replacement string edit box.
"Make case sensitive" : Checked by default, hence ALL text manipulations are case sensitive - "Friends" and "friends" are two DIFFERENT strings by default !
You just have to uncheck the "Case sensitive" checkbox to make a case insensitive search.

Note: During a case insensitive search, this program will consume DOUBLE the memory than the same case sensitive search.

With ALL options turned off, substrings will ALSO qualify as a word. Use the "Browse" button to select the file on which you want to operate. The input filename is Test.txt (which is packaged), and the processed filename is TestOut.txt

Text files and the CR/LF (0x0D & 0X0A pair) set - impact on this program

Let me assert that what I am going to say is STRICTLY my own opinion - there is NO international standard putting this opinion as a requirement or standard.

What I want to say is :

"Line breaks in text files should be represented as a CR/LF pair."

Please be aware that you are free to choose if you want to represent line breaks as LF or CR or CR/LF.

Actually what we nowadays mean by Line breaks is :

A Line of text [line break here]
Next line continues from here..

Back in the DOS days, the sequence for [line break here] would be "\r\n" i.e. 0X0D & 0X0A.

Whereas, a lone LF ('\n' i.e. 0x0A) would be used to represent :

A Line of text [LF here]
           Next line continues from here..

Got it ? A LF is NOT supposed to carry the character carriage into the next paragraph, but simply move down a line and continue from the current x coordinate !

But nowadays, thanks to the individuality of CR and LF NOT being recognized, we have to simulate the second line indentation above with a few tabs and/or white spaces, whereas just one LF would have sufficed !

A lone CR was then often used to signify a END of transmission of a sequence of alphabets. However, with better methods of transmission signaling, this soon became just a formality, and even nowadays Socket libraries recognize a "\r\n" as end of transmission of data or commands.

Now the obvious question that you ask is : Why are you telling me this ? Well, good question ! (It shows that you are reading my article with enthusiasm ;) Here comes the reason :-

Text files like HTML pages transferred in binary mode may NOT have a few character sequences correct. These files can be made out if you open them separately in Wordpad and Notepad. In Notepad, the HTML will look non formatted (without proper indentations etc), but in Wordpad, they look well laid out. This is because the HTML file has been transferred in binary mode over the internet, and as a result, the newline and carriage returns have NOT been corrected as it would have been if it had been transferred in text mode. This is why the HTML source looks ugly in Notepad. Wordpad however, corrects the carriage-return-newline sequences, and that's why the HTML source looks so proper.

So, suppose you download a HTML page which had been displayed in your browser as :

First line.
Second line.

And you want my program to replace to replace the above text with:

No ! I don't want these two lines.

So you enter

First line.
Second line.

as search string into my program and replacement string as

No ! I don't want these two lines.

and press OK.

Then you see that my program "misbehaves", and foolishly says "Sorry! .. search string NOT found !".

And so you send me an email saying how #%***@! I am and how you wish **** ! (Yeah ! The first time I released this utility, I really got 52 such mails about this VERY thing. That's why I updated my program to add the conversion routines !)

Important: This NOTE is NOT applicable to WebPages saved with Internet Explorer and derivative browsers. IE automatically does the required conversion before saving (but you can experiment anyway).

The reason why my program "misbehaves" is that during binary transfers, the data is NOT modified by the system anyway. Hence if a new line was represented as '\n' on the server, it IS kept only a '\n' and NOT the desired "\r\n"

So what displayed as

First line.
Second line.

in your browser was perhaps actually

First line.[LF]Second line.

while you entered

First line.[CR/LF]
Second line.

as the search string ! I have verified this by looking at such a file using a hexeditor. You can too.

So all that needs to be done is change a lone '\n' (AND not an existing "\r\n") to a "\r\n". Then it will be ready for processing, because you see "First line.[LF]Second line." is NOT same as "First line.[CR/LF]Second line." ! That is, change lone "0x0A" to "0x0D 0x0A"

The current version, can now operate on such files and allows you to do the following:

Convert search string CR/LF to LF : (Change the CR\LF of the search string to lone LFs) This will allow you to operate on the above mentioned files where newlines are represented by lone LFs rather than the proper CR\LF combination.
Convert file LF -> CR\LF : (Change ALL LFs of the file you are to operate on to CR\LF pairs) This will reformat the above mentioned type of files where all LF are substituted by the correct pair. Then the processed file will be properly formatted and readable even in simple text editors like Notepad.

Program performance and Limitations

Please note that in order to boost program performance, and reduce programming bugs, the WHOLE text file is read into RAM. This approach has been detested by some, but appreciated by many, so I decided to stick with it.

However, this approach puts a constraint on the program - the file size on which you can operate is NOT only limited by the RAM you have, but also to the maximum default dynamic memory allocation that your OS and the MFC architecture allows to a single program, AND the program heap size allocation which changes from system to system, time to time.

But as far as I am concerned, I do NOT have 100MB long text files to check my program with - I tried a 5MB HTML file maximum, and it worked fine. I am trying to put GlobalAllocation functions to use, but this will take time - anybody who can do it will do me a GREAT favour if not to others using this program.

I should also bring to your notice that though I have tested the program with text files having more complex text than any real life text files can have, still I may have missed something - if you come across a BUG, please NOTIFY ME immediately with anything that can help me reproduce the BUG, or/and including suggestions at improvement.

Using the code

I would like to bring to your notice the ScanR utility which is also at CodeProject and produced by yours truly which has put this engine to test. In fact, ScanR is a unique text file string search and replacement utility and as in there we are handling MANY files together, the CleanR engine has been optimized at a few places - and you may interest yourself there !

IMHO, ScanR is currently the BEST example demonstrating the power of this class. You can make an idea of how fast this engine is , and with your help willing, it will surely be even better (I know that there are bugs - but where ? Gonna find out :) )

You can straightaway use the class without making ANY modifications (and that will be better for tracking down bugs in the code)

Using the class is as easy as this:

CCleanRboolSet theSet;
    
//.. set member variables of theSet 


CCleanR theR(theSet); //call the parameterized constructor to 

                      // send the options

theR.SetFileName(Put FilePath here ..);
theR.SetReplacementString(Put Replacement String here ..);
theR.SetSearchString(Put search string here ...);
theR.Process();

The steps involved are:

Declare an object of the CCleanRboolSet structure (say theR), which actually encapsulates the program options. Initialize the member variables to proper values to reflect your requirements. For e.g., if you want the search to be case sensitive, do the following :
```
theSet.bCaseSensetive=m_bCaseSensetive;
```
Create a object of CCleanR, and pass the above object into the constructor - for this example it will be:
```
CCleanR theR(theSet);
```

Set the filename on which we will be operating, the search string, the replacement string, for eg:

theR.SetFileName(Put FilePath here ..);
theR.SetReplacementString(Put Replacement String here ..);
theR.SetSearchString(Put search string here ...);

The call the CCleanR::Process() function to do the processing, and this function will return the number of matches found.
```
theR.Process();
```
You are done !

Points of Interest (mainly to me)

While coding this engine, I found out that CString really does a GOOD thing - it changes ALL occurrences of "\n" to "\r\n" which may take more space (2 more bytes for every '\n' replaced) , but it's THE correct method for text files. Simple text editors like Notepad stores the Enter key as plain newline, and NOT a carriage return - newline pair, which is the recommended way of storing in text files.

As I have told you before, I am reading the WHOLE text file into memory which boosts program performance in a BIG, BIG way. I also assumed that the user is not so productive so as to create 100MB text files. But I am really curious to know what might happen then ? So far even a 15MB text file has caused NO problem. To be frank, a 15MB text file is way TOO much of reading for me ;)

Many of you may point out using the CreateFileMapping() might help - do send in your views. I think that we will mostly be using this tool on small text files, and file mapping will put in way too much overhead. That's why I am hesitant on file mapping.

History

14th June - Public open release.
23rd June - Added code for CR/LF inter-conversions.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here