Introduction
This article addresses the problem of determining when a document has been modified. In the particular case of word processors, it appears that most programs tend to have a 'dirty' flag that is set when a user types. This seem to be the case even when a letter is typed, and then backspaced over (essentially leaving the document unchanged).
This article will demonstrate an improved method using a deterministic algortihm. In addition, will demonstrate cryptographic programming using Wei Dai's Crypto++ library. The sample program is a Document/View SDI, with the View class derived from CEditView
.
Downloads
There are three downloads available with this project. They are presented at the end of the article.
Time-Space Considerations
O(n) is Computer Science notation for an upper bound on steps the computer will perform in the operation; T(n) is similar in that it is a time consideration. It allows one to analyze algorithms. For a discussion, see Wikipedia's entry on Big O Notation.
This article calculates a hash of the document (as required) to develop state when determining if a document has been modified. An alternate approach would be to hold a copy of the original or last saved document, and simply perform a deep compare.
The hash operation is O(n), where n is the document size. Depending on the hash, the time to hash the document T(n) is constant bound, or cT(n). c will generally be tightly coupled to the number of rounds in the transformation.
A deep compare (memcmp()
) of the previous and current document requires approximately twice the space, but only requires cT(n) time, where c = 1.
Implememtations which use a dirty flag are O(1) and T(1), but are not reliable.
CRC versus Hash
During recent reading, the author discovered an article by Dr. Newcomer entitled A Checksum Algortihm. Dr. Newcomer proposes the use of a checksum algorithm to determine the modified state of a document. The author feels this is not the correct application of a CRC since CRCs are generally constructed for data integrity in the domain of trasmission errors.
The following is taken from Wikipedia's Cyclic Redundancy Check, CRCs and Data Integrity:
While useful for error detection, CRCs cannot be safely relied upon to fully verify data integrity in the face of intelligent (rather than random) changes... This is because the CRC is a linear code, with the result that the set of bits that change in the CRC depends only on the set of bits that changed in the message, not on the values of those (or any other) bits.
CRC Error Detection
In general, CRCs suffer from the fact that since they are based on polynomial division, the CRC fails to correctly identify the following situations. For a more complete treatment, see Mathematics of CRCs:
- Leading 0's are omitted from the data stream
- Leading 0's are added to the data stream
- Trailing 0's are omitted from the data stream
In addition, Martin Stigge, Henryk Pl�tz, Wolf M�ller, and Jens-Peter Redlich have published Reversing CRC � Theory and Practice.
Collision Resistance
In crytopgraphy, collision resistance is defined as follows (taken from Handbook of Applied Cryptography by Menezes, van Oorschot, and Vanstone):
It is computationally infeasible to find any two distinct inputs x, x' which hash to the same output such that h(x) = h(x')
Alternate terminologys is offered as follows:
- preimage resistance = one way function
- 2nd preimage resistance = weak collision resistance
- collision resistance = strong collision resistance
Collision Resisitance in a hash is correlated to the Avalanche Effect. That is, if one changes one bit of the input, a hash should change approximately 50% of the bits of the output. Collision Resistance overcomes the short comings of CRC Error Detection.
The tables below display the resulting bit propogation based on Adler-32, CRC-32, and MD5 hashing of string data. Tables 1 through 3 evaluate the effects of changing 1 bit ('0' → '1'), Tables 4 to 6 evaluate 2 bits ('A' → 'B'). SHA hashes perform similar to MD5, demonstrating between 45% - 55% propogation. The tables were omitted for article clarity. For the puposes of bit propogation, 30% = 70%, 60% = 40%, etc - under propogating is the same deficiency as over propogating.
Below the reader will find an XOR table for convenience (based on Adler-32, AAAA and AAAB digest).
One Bit Changes
Adler-32
|
Digest
|
XOR Difference |
Change
|
0 1
|
00310031 00320032
|
00030003
|
4/32 (12%)
|
AAA0 AAA1
|
027D00F4 027E00F5
|
00030001
|
3/32 (9%)
|
AAAA AAA0 AAAA AAA1
|
0AC00218 0AC10219
|
00010001
|
2/32 (6%)
|
ABCD...WXYZ0 ABCD...WXYZ1
|
6CB60810 6CB70811
|
00010001
|
2/32 (6%)
|
|
Table 1: Adler-32, One Bit Change
|
CRC-32
|
Digest
|
XOR Difference
|
Change
|
0 1
|
21DFDBF4 B7EFDC83
|
96300777
|
8/32 (25%)
|
AAA0 AAA1
|
5B490FBC CD7908CB
|
96300777
|
8/32 (25%)
|
AAAA AAA0 AAAA AAA1
|
DEDF25B0 48EF22C7
|
96300777
|
8/32 (25%)
|
ABCD...WXYZ0 ABCD...WXYZ1
|
BD691021 2B591756
|
96300777
|
8/32 (25%)
|
|
Table 2: CRC32, One Bit Change
|
MD5
|
Digest
|
XOR Difference
|
Change
|
0 1
|
CFCD208495D565EF66E7DFF9F98764DA C4CA4238A0B923820DCC509A6F75849B
|
0B0762BC356C466D6B2B8F6396F2E041
|
62/128 (48%)
|
AAA0 AAA1
|
2FD6E97DEB1D167E3EB4E7C5C1237AC6 D43D15F17EB3ADA99F95EEBCE8578761
|
FBEBFC8C95AEBBD7A12109792974FDA7
|
74/128 (57%)
|
AAAA AAA0 AAAA AAA1
|
8B0DFB3C7DA7C12B2E87852CD62EAE6D 1C5E801C64137B76049838F8A0E55481
|
97537B2019B4BA5D2A1FBDD476CBFAEC
|
72/128 (56%)
|
ABCD...WXYZ0 ABCD...WXYZ1
|
323E24F03822EB3FBEC887A24B809CC7 837A1A56A7FD502174E4FC16191BFD7F
|
B1443EA69FDFBB1ECA2C7BB4529B61B8
|
70/128 (54%)
|
|
Table 3: MD5, One Bit Change
|
Two Bit Changes
Adler-32
|
Digest
|
XOR Difference |
Change
|
A B
|
00420042 00430043
|
00010001
|
2/32 (6%)
|
AAAA AAAB
|
028E0105 028F0106
|
00010003
|
3/32 (9%)
|
AAAA AAAA AAAA AAAB
|
0AD10229 0AD2022A
|
00030003
|
4/32 (12%)
|
ABCD...WXYZA ABCD...WXYZB
|
6CC70821 6CC80822
|
000F0003
|
6/32 (18%)
|
|
Table 4: Adler-32, Two Bit Changes
|
CRC-32
|
Digest
|
XOR Difference
|
Change
|
A B
|
8B9ED9D3 31CFD04A
|
BA510999
|
7/32 (21%)
|
AAAA AAAB
|
F1080D9B 4B590402
|
BA510999
|
7/32 (21%)
|
AAAA AAAA AAAA AAAB
|
749E2797 CECF2E0E
|
BA510999
|
7/32 (21%)
|
ABCD...WXYZA ABCD...WXYZB
|
17281206 AD791B9F
|
BA510999
|
7/32 (21%)
|
|
Table 5: CRC32, Two Bit Changes
|
MD5
|
Digest
|
XOR Difference
|
Change
|
A B
|
7FC56270E7A70FA81A5935B72EACBE29 9D5ED678FE57BCCA610140957AFAB571
|
E29BB40819F0B3627B58752254560B58
|
58/128 (45%)
|
AAAA AAAB
|
098890DDE069E9ABAD63F19A0D9E1F32 7D578BB564C511CA301F558528BDE644
|
74DF1B6884ACF8619D7CA41F2523F976
|
67/128 (52%)
|
AAAA AAAA AAAA AAAB
|
DE3865B13293D2C13DB8ABD82E919489 CCA2BE100C5AD50E7246048420D1ACEF
|
129ADBA13EC907CF4FFEAF5C0E403866
|
66/128 (51%)
|
ABCD...WXYZA ABCD...WXYZB
|
09F0A85821A6A879FB695AA5C0BA62C0 9489B09767A1E12FE501D05BA0AE2C50
|
9D7918CF460749561E688AFE60144E90
|
58/128 (45%)
|
|
Table 6: MD5, Two Bit Changes
|
The notation 'ABCD...WXYZ0' is actually 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0'. Similar is true for 'ABCD...WXYZ1', 'ABCD...WXYZA', and 'ABCD...WXYZB'.
SHA
The method chosen for this article is SHA or the Secure Hash Algorithm. SHA is a NIST-approved, collision resistant hash developed by the NSA for NIST. Other hashes exist, such as MD4, MD5, and RIPE. Some popular hashes, such as MD4 and MD5, are no longer considered to be cryptographically secure. However, any collision resistant hash should suit our needs as cryptographic security is not a requirement.
Compiling and Integrating Crypto++
Please see the related article, Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment. This article is based upon basic assumptions presented in the previosly mentioned article. It also addresses most problems encountered from Command Line to MFC Projects (Errors C1083, C1189, LNK2001, LNK2004, and LNK2005).
The DirtyPad Program
MSDN has quite a few text editor examples � Superpad, Multipad, and WordPad to name a few. This article presents DirtyPad � a CEditView-derived class. The benefits of using a CEditView class is that the CEdit base class function GetWindowText(...). One has easy and convenient access to the underlying string data for hashing.
Create a new MFC SDI application.
At Step 5, choose to use MFC as a statically linked library.
At the final screen of the Wizard (Step 6), choose to derive the View from CEditView.
In stdafx.h
, add the following:
#ifdef UNICODE
# pragma comment( linker, "/ENTRY:wWinMainCRTStartup" )
#endif
#ifdef _DEBUG
# pragma comment( lib, "cryptlibd" )
#else
# pragma comment( lib, "cryptlib" )
#endif
In dirtypadDoc.h
, add the following protected members:
BYTE m_pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
UINT m_ulLength;
DIGESTSIZE
is a constant defined in the SHA
class, which is part of the CryptoPP
namespace. In the case of SHA256, it is 256/8 = 32 bytes. If one choses a different hash, the digest may be a different size. m_pcbDigest
will be updated as outlined below.
As a side note, the author tries not to pollute the global names space with symbols. So, instead of issuing a 'using namespace CryptoPP', objects are brought in as required with the scope operator. See Migrating to Namespaces by Herb Sutter in the October 2000 issue of Dr. Dobbs Journal for an excellent treatment of the subject.
DirtyPad will create snapshots of the document by means of a hash at strategic times during the document's life by Overriding MFC Functions (i.e., responding to Messages sent by the Operating System).
OnOpenDocument()
Though a new document is empty, the program needs to hash the document. An empty document does not equate to a NULL hash.
BOOL CDirtypadDoc::OnNewDocument()
{
if (!CDocument::OnNewDocument())
return FALSE;
((CEditView*)m_viewList.GetHead())->SetWindowText(NULL);
CString szText;
GetText( szText );
CalculateDigest( szText, szText.GetLength(), m_pcbDigest );
m_ulLength = szText.GetLength();
return TRUE;
}
OnSaveDocument(�)
OnSaveDocument(�)
is a fairly trivial implementation: it calculates the hash of the document (saving it m_pcbDigest
), and then allows the CDocument
base class to perform its work
BOOL CDirtypadDoc::OnSaveDocument(LPCTSTR lpszPathName)
{
CString szText;
GetText( szText );
CalculateDigest( szText, szText.GetLength(), m_pcbDigest );
m_ulLength = szText.GetLength();
return CDocument::OnSaveDocument(lpszPathName);
}
SaveModified()
SaveModified()
is the fuction that will prove to be most interesting for this article. The function calculates a hash of the current document, and compares it with a previously saved hash.
BOOL CDirtypadDoc::SaveModified()
{
BYTE pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
CString szText;
if( FALSE == GetText( szText ) )
{ return CDocument::SaveModified(); }
if( m_ulLength != szText.GetLength() )
{ return CDocument::SaveModified(); }
if( FALSE == CalculateDigest( szText, szText.GetLength(), pcbDigest ) )
{ return CDocument::SaveModified(); }
if( 0 != memcmp( m_pcbDigest, pcbDigest, CryptoPP::SHA256::DIGESTSIZE ) )
{ return CDocument::SaveModified(); }
return TRUE;
}
DirtyPad will always defer to CDocument::SaveModified()
under the following two conditions:
- The underlying CEdit string data is not available (
GetText(�)
returned FALSE
)
- The previous and current document lengths are different
- The program fails at calculating the current document's hash
Should the above conditions not fail, the previous document's hash is compared to the current document's hash. If the hashes are the same, TRUE is returned to indicate DirtyPad handled the message. If the hashes are different, base class CDocument::SaveModified()
is called. The later case presents the dialog below.
Add the following protected member function to calculate the hash of the document:
BOOL CDirtypadDoc::CalculateHash( LPCTSTR pszText, UINT nLength,
BYTE *pcbDigest )
{ try {
CryptoPP::SHA256 hash;
hash.Update( (BYTE*)(LPCTSTR)pszText, nLength * sizeof(TCHAR) );
hash.Final( pcbDigest );
}
catch( ... )
{ bReturn = FALSE; }
return TRUE;
}
When hashing, if one finds the document has other data members (such as an INT) that should be included in the document's state for hashing, perform the following:
CryptoPP::SHA256 hash;
INT i = 1;
hash.Update( (UCHAR*)&i, sizeof(INT) );
One may be able to actually call CEdit::SetModify(). However, be aware that MSDN states the following (from CEditView::GetEditCtrl()
):
WARNING! Using the CEdit
object can change the state of the underlying Windows edit control. For example, you should not change the tab settings using the CEdit::SetTabStops
function because CEditView
caches these settings for use both in the edit control and in printing. Instead, use CEditView::SetTabStops
.
Document Sizes
The author's sampling of common word processor documents in a Corporate environment of 600 users revealed most were less than 50 Kb in size (though a 100 Kb file was not uncommon). In the case of Word documents, the statistic was 35 Kb with N ≈ 2800 and σ = 15 Kb. With that said, the following is offered for argumentative readers who like to offer 100 MB or 1.0 Gb files as an argument. Programming Applications for Microsoft Windows by Jeffrey Richter, Chapter 17, Memory Mapped Files should prove invaluable for working with a small subset of a large file. If one is modifying many parts of a large file, forgo hashing a write the entire file. At this point, the reader should consider redisigning his or her work to accomodate saving smaller file subsets.
Downloads
Revisions
- 11.07.2007 Added Second Preimage Hash Information
- 07.07.2007 Updated Collision Resistance Section
- 06.07.2007 Added reference to Reversing CRC � Theory and Practice
- 06.07.2007 Added CRC Error Detection
- 05.28.2007 Updated CRC vs Hash Section
- 01.01.2007 Visual Studio 8.0 Port
- 01.01.2007 Added m_lLength member
- 12.23.2006 Added Bit Propogation Tables
- 12.23.2006 Uploaded BitProp Source Code
- 12.19.2006 Added CRC vs Hash Section
- 12.19.2006 Added Collision Resistance Section
- 12.19.2006 Added File Checksums
- 11.16.2006 Grammar and Spelling Corrections
- 11.14.2006 Modified Crypto++ Content
- 11.14.2006 Added Reference to Richter's Work
- 09.19.2005 Initial Release
Checksums
- BitProp.zip
MD5: 9DB820350F29B730B12492629B490DD4
SHA-1: C907D5AC1013C3A9C2934D5FF5235CB8121AC461
- DirtyPad_demo.zip
MD5: F5AFEB014AE8A4A6EB023CEA95A29C6E
SHA-1: 770C45E7FE41924DDAE57369EF156B6D0BE6A1D8
- DirtyPad_src.zip
MD5: 62881C0BC7587D47BD80198A857CA140
SHA-1: B9577FB44C7AEA2026BF4FD900B658437843D3A4