Introduction
This was actually an idea that somehow entered my head after getting hundreds
of e-mails from people asking me to help them clean their systems infected with
the Win32.Redlof.A HTML based script following my own article on the script
based virus at http://neworder.box.sk
titled "Paper on the Win32.Redlof.A virus" (or so)
I set out to write a small utility which would find the infected files and
remove the offending code, and so CleanR was born. What this utility does is
search a specific string occurrence inside TEXT (ASCII) files and replace them
with another string which may also be NULL (which amounts to replacing the
search string fully).
Thus to remove the virus string, I put the Search string to the virus script
and the replacement string to NULL. Thankfully this approach worked, and
everybody lived happily ever after.
Soon after, people were also using my little tool from removing advertisement
and popup script lines from saved HTML pages to correcting typos in their text
files.
Last time I heard, someone was recommending my utility to replace the
following script from saved Yahoo! mail pages:
<script>
<!--
function Help(link)
{
window.open(link,"help",
"width=400,height=500,scrollbars=yes,dependent=yes");
}
if (document.cookie.indexOf("o/Rl.A") == -1) {
window.open("http://mail.yahoo.com", "_top");
}
</script>
By specifying the above script as the search string and keeping the
replacement string to NULL, the script was removed and they were able to view
their saved mail again! (Otherwise, with the above script only a window to http://mail.yahoo.com would have opened,
instead of the saved mail!) (If I am correct, Yahoo! has stopped putting the
above script into their pages since !)
What you can now do with the options
- "Check only words" : Ask the utility to replace specifically words
instead of substrings, ie, if the program comes across a string
"WeAreAllFriends" and you want to replace "Friends" with "Foes", it will NOT
replace this string, but if the string was "We are all Friends", then "Friends"
would be replaced to "We are all Foes"
Please note that by default, ALL text manipulations are case sensitive -
"Friends" and "friends" are two DIFFERENT strings by default ! You just have to
uncheck the "Case sensitive" checkbox to make a case insensitive search.
Note: During a case insensitive search, this program will consume DOUBLE
the memory than the same case sensitive search.
- "Be strict on grammar" : Ask the utility to change ONLY those words
which have correct grammar - e.g., If Search string is "Friends" , and the
program comes across:
"What kind of Friends!Wish I died!"
"What kind of Friends;wish I died!"
"What kind of Friends!-Wish I died!"
etc, they are ignored. Examples of valid strings under "Strict grammar"
option are:
"What kind of Friends! Wish I died!"
"What kind of Friends, Wish I died!"
"What kind of Friends - Wish I died!"
"What kind of Friends wish I died!"
[New sentence starts with capitals etc conditions NOT checked now]
As this option automatically includes the condition for checking ONLY words,
the separate option will be disabled once you select this option.
- "Ignore last word punctuation criteria" : Under "Strict Grammar"
option, you can have the choice "Ignore last word punctuation"
This option does not check the grammar validity of the last word in a text
file if the last word matches the Search string.
Example, if "Bye" is the search string, and the above option is checked, and
iff "Bye' is the last word in the text file without proper punctuation mark(s)
following (generally the period) as in:
"With lotsa love - Bye" being the LAST line in the text file, the fact that
"Bye" does not have a required punctuation mark (the period) after it will be
ignored, and it will NOT be ignored as it would otherwise have been if the
option had NOT been enabled
- "Overwrite original file" : Ask the program to OverWrite the original
file with the modified file.
- "Get search string from file" : On checking this box, a file dialog
will open where you can select the text file containing your search string. This
will also disable the search string edit box.
- "Get replacement string from file" : On checking this box, a file
dialog will open where you can select the text file containing the string to
replace the search string in matching files.
This will also disable the
replacement string edit box.
- "Make case sensitive" : Checked by default, hence ALL text
manipulations are case sensitive - "Friends" and "friends" are two DIFFERENT
strings by default !
You just have to uncheck the "Case sensitive" checkbox
to make a case insensitive search.
Note: During a case insensitive search, this program will consume
DOUBLE the memory than the same case sensitive search.
With ALL options turned off, substrings will ALSO qualify as a word. Use the
"Browse" button to select the file on which you want to operate. The input
filename is Test.txt (which is packaged), and the processed filename is
TestOut.txt
Text files and the CR/LF (0x0D & 0X0A pair) set - impact on this
program
Let me assert that what I am going to say is STRICTLY my own opinion
- there is NO international standard putting this opinion as a requirement or
standard.
What I want to say is :
"Line breaks in text files should be represented as a CR/LF pair."
Please be aware that you are free to choose if you want to represent line
breaks as LF or CR or CR/LF.
Actually what we nowadays mean by Line breaks is :
A Line of text [line break here]
Next line continues from here..
Back in the DOS days, the sequence for [line break here] would be
"\r\n" i.e. 0X0D & 0X0A.
Whereas, a lone LF ('\n' i.e. 0x0A) would be used to represent :
A Line of text [LF here]
Next line continues from here..
Got it ? A LF is NOT supposed to carry the character carriage into the
next paragraph, but simply move down a line and continue from the current
x coordinate !
But nowadays, thanks to the individuality of CR and LF NOT being recognized,
we have to simulate the second line indentation above with a few tabs and/or
white spaces, whereas just one LF would have sufficed !
A lone CR was then often used to signify a END of transmission
of a sequence of alphabets. However, with better methods of transmission
signaling, this soon became just a formality, and even nowadays Socket libraries
recognize a "\r\n" as end of transmission of data or commands.
Now the obvious question that you ask is : Why are you telling me this ?
Well, good question ! (It shows that you are reading my article with
enthusiasm ;) Here comes the reason :-
Text files like HTML pages transferred in binary mode may NOT have a
few character sequences correct. These files can be made out if you open them
separately in Wordpad and Notepad. In Notepad, the HTML will look non formatted
(without proper indentations etc), but in Wordpad, they look well laid out. This
is because the HTML file has been transferred in binary mode over the internet,
and as a result, the newline and carriage returns have NOT been corrected as it
would have been if it had been transferred in text mode. This is why the HTML
source looks ugly in Notepad. Wordpad however, corrects the
carriage-return-newline sequences, and that's why the HTML source looks so
proper.
So, suppose you download a HTML page which had been displayed in your browser
as :
First line.
Second line.
And you want my program to replace to replace the above text with:
No ! I don't want these two lines.
So you enter
First line.
Second line.
as search string into my program and replacement string as
No ! I don't want these two lines.
and press OK.
Then you see that my program "misbehaves", and foolishly says "Sorry! ..
search string NOT found !".
And so you send me an email saying how #%***@! I am and how you wish **** !
(Yeah ! The first time I released this utility, I really got 52 such mails about
this VERY thing. That's why I updated my program to add the conversion routines
!)
Important: This NOTE is NOT applicable to WebPages saved with Internet
Explorer and derivative browsers. IE automatically does the required conversion
before saving (but you can experiment anyway).
The reason why my program "misbehaves" is that during binary transfers, the
data is NOT modified by the system anyway. Hence if a new line was represented
as '\n' on the server, it IS kept only a '\n' and NOT the desired "\r\n"
So what displayed as
First line.
Second line.
in your browser was perhaps actually
First line.[LF]Second line.
while you entered
First line.[CR/LF]
Second line.
as the search string ! I have verified this by looking at such a file
using a hexeditor. You can too.
So all that needs to be done is change a lone '\n' (AND not an existing
"\r\n") to a "\r\n". Then it will be ready for processing, because you see
"First line.[LF]Second line." is NOT same as "First line.[CR/LF]Second line." !
That is, change lone "0x0A" to "0x0D 0x0A"
The current version, can now operate on such files and allows you to do the
following:
- Convert search string CR/LF to LF : (Change the CR\LF of the search
string to lone LFs) This will allow you to operate on the above mentioned files
where newlines are represented by lone LFs rather than the proper CR\LF
combination.
- Convert file LF -> CR\LF : (Change ALL LFs of the file you are to
operate on to CR\LF pairs) This will reformat the above mentioned type of files
where all LF are substituted by the correct pair. Then the processed file will
be properly formatted and readable even in simple text editors like Notepad.
Program performance and Limitations
Please note that in order to boost program performance, and reduce
programming bugs, the WHOLE text file is read into RAM. This approach has been
detested by some, but appreciated by many, so I decided to stick with it.
However, this approach puts a constraint on the program - the file size on
which you can operate is NOT only limited by the RAM you have, but also to the
maximum default dynamic memory allocation that your OS and the MFC architecture
allows to a single program, AND the program heap size allocation which changes
from system to system, time to time.
But as far as I am concerned, I do NOT have 100MB long text files to check my
program with - I tried a 5MB HTML file maximum, and it worked fine. I am trying
to put GlobalAllocation functions to use, but this will take time - anybody who
can do it will do me a GREAT favour if not to others using this program.
I should also bring to your notice that though I have tested the program with
text files having more complex text than any real life text files can have,
still I may have missed something - if you come across a BUG, please NOTIFY ME
immediately with anything that can help me reproduce the BUG, or/and including
suggestions at improvement.
Using the code
I would like to bring to your notice the ScanR utility which is also at
CodeProject and produced by yours truly which has put this engine to test. In
fact, ScanR is a unique text file string search and replacement utility and as
in there we are handling MANY files together, the CleanR engine has been
optimized at a few places - and you may interest yourself there !
IMHO, ScanR is currently the BEST example demonstrating the power of this
class. You can make an idea of how fast this engine is , and with your help
willing, it will surely be even better (I know that there are bugs - but where ?
Gonna find out :) )
You can straightaway use the class without making ANY modifications (and that
will be better for tracking down bugs in the code)
Using the class is as easy as this:
CCleanRboolSet theSet;
CCleanR theR(theSet);
theR.SetFileName(Put FilePath here ..);
theR.SetReplacementString(Put Replacement String here ..);
theR.SetSearchString(Put search string here ...);
theR.Process();
The steps involved are:
Points of Interest (mainly to me)
While coding this engine, I found out that CString
really does a
GOOD thing - it changes ALL occurrences of "\n" to "\r\n" which may take more
space (2 more bytes for every '\n' replaced) , but it's THE correct method for
text files. Simple text editors like Notepad stores the Enter key as plain
newline, and NOT a carriage return - newline pair, which is the recommended way
of storing in text files.
As I have told you before, I am reading the WHOLE text file into memory which
boosts program performance in a BIG, BIG way. I also assumed that the user is
not so productive so as to create 100MB text files. But I am really
curious to know what might happen then ? So far even a 15MB text file has caused
NO problem. To be frank, a 15MB text file is way TOO much of reading for me ;)
Many of you may point out using the CreateFileMapping()
might
help - do send in your views. I think that we will mostly be using this tool on
small text files, and file mapping will put in way too much overhead. That's why
I am hesitant on file mapping.
History
- 14th June - Public open release.
- 23rd June - Added code for CR/LF inter-conversions.