(untagged)

A Deterministic Method of Determining a Document's Modified State

Jeffrey Walton

0.00/5 (No votes)

8 Nov 2007

Compare a Document's Previous and Current States using Hashing and Crypto++

Introduction

This article addresses the problem of determining when a document has been modified. In the particular case of word processors, it appears that most programs tend to have a 'dirty' flag that is set when a user types. This seem to be the case even when a letter is typed, and then backspaced over (essentially leaving the document unchanged).

This article will demonstrate an improved method using a deterministic algortihm. In addition, will demonstrate cryptographic programming using Wei Dai's Crypto++ library. The sample program is a Document/View SDI, with the View class derived from CEditView.

Downloads

There are three downloads available with this project. They are presented at the end of the article.

Time-Space Considerations

O(n) is Computer Science notation for an upper bound on steps the computer will perform in the operation; T(n) is similar in that it is a time consideration. It allows one to analyze algorithms. For a discussion, see Wikipedia's entry on Big O Notation.

This article calculates a hash of the document (as required) to develop state when determining if a document has been modified. An alternate approach would be to hold a copy of the original or last saved document, and simply perform a deep compare.

The hash operation is O(n), where n is the document size. Depending on the hash, the time to hash the document T(n) is constant bound, or cT(n). c will generally be tightly coupled to the number of rounds in the transformation.

A deep compare (memcmp()) of the previous and current document requires approximately twice the space, but only requires cT(n) time, where c = 1.

Implememtations which use a dirty flag are O(1) and T(1), but are not reliable.

CRC versus Hash

During recent reading, the author discovered an article by Dr. Newcomer entitled A Checksum Algortihm. Dr. Newcomer proposes the use of a checksum algorithm to determine the modified state of a document. The author feels this is not the correct application of a CRC since CRCs are generally constructed for data integrity in the domain of trasmission errors.

The following is taken from Wikipedia's Cyclic Redundancy Check, CRCs and Data Integrity:

While useful for error detection, CRCs cannot be safely relied upon to fully verify data integrity in the face of intelligent (rather than random) changes... This is because the CRC is a linear code, with the result that the set of bits that change in the CRC depends only on the set of bits that changed in the message, not on the values of those (or any other) bits.

CRC Error Detection

In general, CRCs suffer from the fact that since they are based on polynomial division, the CRC fails to correctly identify the following situations. For a more complete treatment, see Mathematics of CRCs:

Leading 0's are omitted from the data stream
Leading 0's are added to the data stream
Trailing 0's are omitted from the data stream

In addition, Martin Stigge, Henryk Pl�tz, Wolf M�ller, and Jens-Peter Redlich have published Reversing CRC � Theory and Practice.

Collision Resistance

In crytopgraphy, collision resistance is defined as follows (taken from Handbook of Applied Cryptography by Menezes, van Oorschot, and Vanstone):

It is computationally infeasible to find any two distinct inputs x, x' which hash to the same output such that h(x) = h(x')

Alternate terminologys is offered as follows:

preimage resistance = one way function
2^nd preimage resistance = weak collision resistance
collision resistance = strong collision resistance

Collision Resisitance in a hash is correlated to the Avalanche Effect. That is, if one changes one bit of the input, a hash should change approximately 50% of the bits of the output. Collision Resistance overcomes the short comings of CRC Error Detection.

The tables below display the resulting bit propogation based on Adler-32, CRC-32, and MD5 hashing of string data. Tables 1 through 3 evaluate the effects of changing 1 bit ('0' → '1'), Tables 4 to 6 evaluate 2 bits ('A' → 'B'). SHA hashes perform similar to MD5, demonstrating between 45% - 55% propogation. The tables were omitted for article clarity. For the puposes of bit propogation, 30% = 70%, 60% = 40%, etc - under propogating is the same deficiency as over propogating.

Below the reader will find an XOR table for convenience (based on Adler-32, AAAA and AAAB digest).

XOR Difference Table of Adler-32 Digest

One Bit Changes

Adler-32	Digest	XOR Difference	Change
0 1	00310031 00320032	00030003	4/32 (12%)
AAA0 AAA1	027D00F4 027E00F5	00030001	3/32 (9%)
AAAA AAA0 AAAA AAA1	0AC00218 0AC10219	00010001	2/32 (6%)
ABCD...WXYZ0 ABCD...WXYZ1	6CB60810 6CB70811	00010001	2/32 (6%)

Table 1: Adler-32, One Bit Change

CRC-32	Digest	XOR Difference	Change
0 1	21DFDBF4 B7EFDC83	96300777	8/32 (25%)
AAA0 AAA1	5B490FBC CD7908CB	96300777	8/32 (25%)
AAAA AAA0 AAAA AAA1	DEDF25B0 48EF22C7	96300777	8/32 (25%)
ABCD...WXYZ0 ABCD...WXYZ1	BD691021 2B591756	96300777	8/32 (25%)

Table 2: CRC32, One Bit Change

MD5	Digest	XOR Difference	Change
0 1	CFCD208495D565EF66E7DFF9F98764DA C4CA4238A0B923820DCC509A6F75849B	0B0762BC356C466D6B2B8F6396F2E041	62/128 (48%)
AAA0 AAA1	2FD6E97DEB1D167E3EB4E7C5C1237AC6 D43D15F17EB3ADA99F95EEBCE8578761	FBEBFC8C95AEBBD7A12109792974FDA7	74/128 (57%)
AAAA AAA0 AAAA AAA1	8B0DFB3C7DA7C12B2E87852CD62EAE6D 1C5E801C64137B76049838F8A0E55481	97537B2019B4BA5D2A1FBDD476CBFAEC	72/128 (56%)
ABCD...WXYZ0 ABCD...WXYZ1	323E24F03822EB3FBEC887A24B809CC7 837A1A56A7FD502174E4FC16191BFD7F	B1443EA69FDFBB1ECA2C7BB4529B61B8	70/128 (54%)

Table 3: MD5, One Bit Change

Two Bit Changes

Adler-32	Digest	XOR Difference	Change
A B	00420042 00430043	00010001	2/32 (6%)
AAAA AAAB	028E0105 028F0106	00010003	3/32 (9%)
AAAA AAAA AAAA AAAB	0AD10229 0AD2022A	00030003	4/32 (12%)
ABCD...WXYZA ABCD...WXYZB	6CC70821 6CC80822	000F0003	6/32 (18%)

Table 4: Adler-32, Two Bit Changes

CRC-32	Digest	XOR Difference	Change
A B	8B9ED9D3 31CFD04A	BA510999	7/32 (21%)
AAAA AAAB	F1080D9B 4B590402	BA510999	7/32 (21%)
AAAA AAAA AAAA AAAB	749E2797 CECF2E0E	BA510999	7/32 (21%)
ABCD...WXYZA ABCD...WXYZB	17281206 AD791B9F	BA510999	7/32 (21%)

Table 5: CRC32, Two Bit Changes

MD5	Digest	XOR Difference	Change
A B	7FC56270E7A70FA81A5935B72EACBE29 9D5ED678FE57BCCA610140957AFAB571	E29BB40819F0B3627B58752254560B58	58/128 (45%)
AAAA AAAB	098890DDE069E9ABAD63F19A0D9E1F32 7D578BB564C511CA301F558528BDE644	74DF1B6884ACF8619D7CA41F2523F976	67/128 (52%)
AAAA AAAA AAAA AAAB	DE3865B13293D2C13DB8ABD82E919489 CCA2BE100C5AD50E7246048420D1ACEF	129ADBA13EC907CF4FFEAF5C0E403866	66/128 (51%)
ABCD...WXYZA ABCD...WXYZB	09F0A85821A6A879FB695AA5C0BA62C0 9489B09767A1E12FE501D05BA0AE2C50	9D7918CF460749561E688AFE60144E90	58/128 (45%)

Table 6: MD5, Two Bit Changes

The notation 'ABCD...WXYZ0' is actually 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0'. Similar is true for 'ABCD...WXYZ1', 'ABCD...WXYZA', and 'ABCD...WXYZB'.

SHA

The method chosen for this article is SHA or the Secure Hash Algorithm. SHA is a NIST-approved, collision resistant hash developed by the NSA for NIST. Other hashes exist, such as MD4, MD5, and RIPE. Some popular hashes, such as MD4 and MD5, are no longer considered to be cryptographically secure. However, any collision resistant hash should suit our needs as cryptographic security is not a requirement.

Compiling and Integrating Crypto++

Please see the related article, Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment. This article is based upon basic assumptions presented in the previosly mentioned article. It also addresses most problems encountered from Command Line to MFC Projects (Errors C1083, C1189, LNK2001, LNK2004, and LNK2005).

The DirtyPad Program

MSDN has quite a few text editor examples � Superpad, Multipad, and WordPad to name a few. This article presents DirtyPad � a CEditView-derived class. The benefits of using a CEditView class is that the CEdit base class function GetWindowText(...). One has easy and convenient access to the underlying string data for hashing.

Create a new MFC SDI application.

At Step 5, choose to use MFC as a statically linked library.

At the final screen of the Wizard (Step 6), choose to derive the View from CEditView.

In stdafx.h, add the following:

#ifdef UNICODE
#  pragma comment( linker, "/ENTRY:wWinMainCRTStartup" )
#endif

#ifdef _DEBUG
#  pragma comment( lib, "cryptlibd" )
#else
#  pragma comment( lib, "cryptlib" )
#endif

In dirtypadDoc.h, add the following protected members:

BYTE m_pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
UINT m_ulLength;

DIGESTSIZE is a constant defined in the SHA class, which is part of the CryptoPP namespace. In the case of SHA256, it is 256/8 = 32 bytes. If one choses a different hash, the digest may be a different size. m_pcbDigest will be updated as outlined below.

As a side note, the author tries not to pollute the global names space with symbols. So, instead of issuing a 'using namespace CryptoPP', objects are brought in as required with the scope operator. See Migrating to Namespaces by Herb Sutter in the October 2000 issue of Dr. Dobbs Journal for an excellent treatment of the subject.

DirtyPad will create snapshots of the document by means of a hash at strategic times during the document's life by Overriding MFC Functions (i.e., responding to Messages sent by the Operating System).

OnOpenDocument()

Though a new document is empty, the program needs to hash the document. An empty document does not equate to a NULL hash.

BOOL CDirtypadDoc::OnNewDocument()
{
   if (!CDocument::OnNewDocument())
      return FALSE;

   ((CEditView*)m_viewList.GetHead())->SetWindowText(NULL);

   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );
   m_ulLength = szText.GetLength();

   return TRUE;
}

OnSaveDocument(�)

OnSaveDocument(�) is a fairly trivial implementation: it calculates the hash of the document (saving it m_pcbDigest), and then allows the CDocument base class to perform its work

BOOL CDirtypadDoc::OnSaveDocument(LPCTSTR lpszPathName)
{
   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );
   m_ulLength = szText.GetLength();

   return CDocument::OnSaveDocument(lpszPathName);
}

SaveModified()

SaveModified() is the fuction that will prove to be most interesting for this article. The function calculates a hash of the current document, and compares it with a previously saved hash.

BOOL CDirtypadDoc::SaveModified()
{  
   BYTE pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
   CString szText;

   if( FALSE == GetText( szText ) )
      { return CDocument::SaveModified(); }

   // The document must have changed if

   //   the lengths are different

   if( m_ulLength != szText.GetLength() )
       { return CDocument::SaveModified(); }

   // Calculate the current hash

   if( FALSE == CalculateDigest( szText, szText.GetLength(), pcbDigest ) )
      { return CDocument::SaveModified(); }

   // Compare the current hash with the previous hash

   if( 0 != memcmp( m_pcbDigest, pcbDigest, CryptoPP::SHA256::DIGESTSIZE ) )
      { return CDocument::SaveModified(); }

   // Any non-zero will do

   return TRUE;
}

DirtyPad will always defer to CDocument::SaveModified() under the following two conditions:

The underlying CEdit string data is not available (GetText(�) returned FALSE)
The previous and current document lengths are different
The program fails at calculating the current document's hash

Should the above conditions not fail, the previous document's hash is compared to the current document's hash. If the hashes are the same, TRUE is returned to indicate DirtyPad handled the message. If the hashes are different, base class CDocument::SaveModified() is called. The later case presents the dialog below.

Add the following protected member function to calculate the hash of the document:

BOOL CDirtypadDoc::CalculateHash( LPCTSTR pszText, UINT nLength,
                                  BYTE *pcbDigest )
{    try {

      // nLength is Character Count (not BYTE count)

      CryptoPP::SHA256 hash;
      hash.Update( (BYTE*)(LPCTSTR)pszText, nLength * sizeof(TCHAR) );
      hash.Final( pcbDigest );
   }

   catch( ... )
      { bReturn = FALSE; }

   return TRUE;
}

When hashing, if one finds the document has other data members (such as an INT) that should be included in the document's state for hashing, perform the following:

CryptoPP::SHA256 hash;
INT i = 1;
hash.Update( (UCHAR*)&i, sizeof(INT) );

One may be able to actually call CEdit::SetModify(). However, be aware that MSDN states the following (from CEditView::GetEditCtrl() ):

WARNING! Using the CEdit object can change the state of the underlying Windows edit control. For example, you should not change the tab settings using the CEdit::SetTabStops function because CEditView caches these settings for use both in the edit control and in printing. Instead, use CEditView::SetTabStops.

Document Sizes

The author's sampling of common word processor documents in a Corporate environment of 600 users revealed most were less than 50 Kb in size (though a 100 Kb file was not uncommon). In the case of Word documents, the statistic was 35 Kb with N ≈ 2800 and σ = 15 Kb. With that said, the following is offered for argumentative readers who like to offer 100 MB or 1.0 Gb files as an argument. Programming Applications for Microsoft Windows by Jeffrey Richter, Chapter 17, Memory Mapped Files should prove invaluable for working with a small subset of a large file. If one is modifying many parts of a large file, forgo hashing a write the entire file. At this point, the reader should consider redisigning his or her work to accomodate saving smaller file subsets.

Downloads

Revisions

11.07.2007 Added Second Preimage Hash Information
07.07.2007 Updated Collision Resistance Section
06.07.2007 Added reference to Reversing CRC � Theory and Practice
06.07.2007 Added CRC Error Detection
05.28.2007 Updated CRC vs Hash Section
01.01.2007 Visual Studio 8.0 Port
01.01.2007 Added m_lLength member
12.23.2006 Added Bit Propogation Tables
12.23.2006 Uploaded BitProp Source Code
12.19.2006 Added CRC vs Hash Section
12.19.2006 Added Collision Resistance Section
12.19.2006 Added File Checksums
11.16.2006 Grammar and Spelling Corrections
11.14.2006 Modified Crypto++ Content
11.14.2006 Added Reference to Richter's Work
09.19.2005 Initial Release

Checksums

BitProp.zip
MD5: 9DB820350F29B730B12492629B490DD4
SHA-1: C907D5AC1013C3A9C2934D5FF5235CB8121AC461
DirtyPad_demo.zip
MD5: F5AFEB014AE8A4A6EB023CEA95A29C6E
SHA-1: 770C45E7FE41924DDAE57369EF156B6D0BE6A1D8
DirtyPad_src.zip
MD5: 62881C0BC7587D47BD80198A857CA140
SHA-1: B9577FB44C7AEA2026BF4FD900B658437843D3A4

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here