Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A Deterministic Method of Determining a Document's Modified State

0.00/5 (No votes)
8 Nov 2007 1  
Compare a Document's Previous and Current States using Hashing and Crypto++

Introduction

This article addresses the problem of determining when a document has been modified. In the particular case of word processors, it appears that most programs tend to have a 'dirty' flag that is set when a user types. This seem to be the case even when a letter is typed, and then backspaced over (essentially leaving the document unchanged).

This article will demonstrate an improved method using a deterministic algortihm. In addition, will demonstrate cryptographic programming using Wei Dai's Crypto++ library. The sample program is a Document/View SDI, with the View class derived from CEditView.

Downloads

There are three downloads available with this project. They are presented at the end of the article.

Time-Space Considerations

O(n) is Computer Science notation for an upper bound on steps the computer will perform in the operation; T(n) is similar in that it is a time consideration. It allows one to analyze algorithms. For a discussion, see Wikipedia's entry on Big O Notation.

This article calculates a hash of the document (as required) to develop state when determining if a document has been modified. An alternate approach would be to hold a copy of the original or last saved document, and simply perform a deep compare.

The hash operation is O(n), where n is the document size. Depending on the hash, the time to hash the document T(n) is constant bound, or cT(n). c will generally be tightly coupled to the number of rounds in the transformation.

A deep compare (memcmp()) of the previous and current document requires approximately twice the space, but only requires cT(n) time, where c = 1.

Implememtations which use a dirty flag are O(1) and T(1), but are not reliable.

CRC versus Hash

During recent reading, the author discovered an article by Dr. Newcomer entitled A Checksum Algortihm. Dr. Newcomer proposes the use of a checksum algorithm to determine the modified state of a document. The author feels this is not the correct application of a CRC since CRCs are generally constructed for data integrity in the domain of trasmission errors.

The following is taken from Wikipedia's Cyclic Redundancy Check, CRCs and Data Integrity:

While useful for error detection, CRCs cannot be safely relied upon to fully verify data integrity in the face of intelligent (rather than random) changes... This is because the CRC is a linear code, with the result that the set of bits that change in the CRC depends only on the set of bits that changed in the message, not on the values of those (or any other) bits.

CRC Error Detection

In general, CRCs suffer from the fact that since they are based on polynomial division, the CRC fails to correctly identify the following situations. For a more complete treatment, see Mathematics of CRCs:

  • Leading 0's are omitted from the data stream
  • Leading 0's are added to the data stream
  • Trailing 0's are omitted from the data stream

In addition, Martin Stigge, Henryk Pl�tz, Wolf M�ller, and Jens-Peter Redlich have published Reversing CRC � Theory and Practice.

Collision Resistance

In crytopgraphy, collision resistance is defined as follows (taken from Handbook of Applied Cryptography by Menezes, van Oorschot, and Vanstone):

It is computationally infeasible to find any two distinct inputs x, x' which hash to the same output such that h(x) = h(x')

Alternate terminologys is offered as follows:

  • preimage resistance = one way function
  • 2nd preimage resistance = weak collision resistance
  • collision resistance = strong collision resistance

Collision Resisitance in a hash is correlated to the Avalanche Effect. That is, if one changes one bit of the input, a hash should change approximately 50% of the bits of the output. Collision Resistance overcomes the short comings of CRC Error Detection.

The tables below display the resulting bit propogation based on Adler-32, CRC-32, and MD5 hashing of string data. Tables 1 through 3 evaluate the effects of changing 1 bit ('0' → '1'), Tables 4 to 6 evaluate 2 bits ('A' → 'B'). SHA hashes perform similar to MD5, demonstrating between 45% - 55% propogation. The tables were omitted for article clarity. For the puposes of bit propogation, 30% = 70%, 60% = 40%, etc - under propogating is the same deficiency as over propogating.

Below the reader will find an XOR table for convenience (based on Adler-32, AAAA and AAAB digest).

XOR Difference Table of Adler-32 Digest

One Bit Changes

Adler-32

Digest

XOR Difference

Change

0
1

00310031
00320032

00030003

4/32 (12%)

AAA0
AAA1

027D00F4
027E00F5

00030001

3/32 (9%)

AAAA AAA0
AAAA AAA1

0AC00218
0AC10219

00010001

2/32 (6%)

ABCD...WXYZ0
ABCD...WXYZ1

6CB60810
6CB70811

00010001

2/32 (6%)

Table 1: Adler-32, One Bit Change

CRC-32

Digest

XOR Difference

Change

0
1

21DFDBF4
B7EFDC83

96300777

8/32 (25%)

AAA0
AAA1

5B490FBC
CD7908CB

96300777

8/32 (25%)

AAAA AAA0
AAAA AAA1

DEDF25B0
48EF22C7

96300777

8/32 (25%)

ABCD...WXYZ0
ABCD...WXYZ1

BD691021
2B591756

96300777

8/32 (25%)

Table 2: CRC32, One Bit Change

MD5

Digest

XOR Difference

Change

0
1

CFCD208495D565EF66E7DFF9F98764DA
C4CA4238A0B923820DCC509A6F75849B

0B0762BC356C466D6B2B8F6396F2E041

62/128 (48%)

AAA0
AAA1

2FD6E97DEB1D167E3EB4E7C5C1237AC6
D43D15F17EB3ADA99F95EEBCE8578761

FBEBFC8C95AEBBD7A12109792974FDA7

74/128 (57%)

AAAA AAA0
AAAA AAA1

8B0DFB3C7DA7C12B2E87852CD62EAE6D
1C5E801C64137B76049838F8A0E55481

97537B2019B4BA5D2A1FBDD476CBFAEC

72/128 (56%)

ABCD...WXYZ0
ABCD...WXYZ1

323E24F03822EB3FBEC887A24B809CC7
837A1A56A7FD502174E4FC16191BFD7F

B1443EA69FDFBB1ECA2C7BB4529B61B8

70/128 (54%)

Table 3: MD5, One Bit Change

Two Bit Changes

Adler-32

Digest

XOR Difference

Change

A
B

00420042
00430043

00010001

2/32 (6%)

AAAA
AAAB

028E0105
028F0106

00010003

3/32 (9%)

AAAA AAAA
AAAA AAAB

0AD10229
0AD2022A

00030003

4/32 (12%)

ABCD...WXYZA
ABCD...WXYZB

6CC70821
6CC80822

000F0003

6/32 (18%)

Table 4: Adler-32, Two Bit Changes

CRC-32

Digest

XOR Difference

Change

A
B

8B9ED9D3
31CFD04A

BA510999

7/32 (21%)

AAAA
AAAB

F1080D9B
4B590402

BA510999

7/32 (21%)

AAAA AAAA
AAAA AAAB

749E2797
CECF2E0E

BA510999

7/32 (21%)

ABCD...WXYZA
ABCD...WXYZB

17281206
AD791B9F

BA510999

7/32 (21%)

Table 5: CRC32, Two Bit Changes

MD5

Digest

XOR Difference

Change

A
B

7FC56270E7A70FA81A5935B72EACBE29
9D5ED678FE57BCCA610140957AFAB571

E29BB40819F0B3627B58752254560B58

58/128 (45%)

AAAA
AAAB

098890DDE069E9ABAD63F19A0D9E1F32
7D578BB564C511CA301F558528BDE644

74DF1B6884ACF8619D7CA41F2523F976

67/128 (52%)

AAAA AAAA
AAAA AAAB

DE3865B13293D2C13DB8ABD82E919489
CCA2BE100C5AD50E7246048420D1ACEF

129ADBA13EC907CF4FFEAF5C0E403866

66/128 (51%)

ABCD...WXYZA
ABCD...WXYZB

09F0A85821A6A879FB695AA5C0BA62C0
9489B09767A1E12FE501D05BA0AE2C50

9D7918CF460749561E688AFE60144E90

58/128 (45%)

Table 6: MD5, Two Bit Changes

The notation 'ABCD...WXYZ0' is actually 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0'. Similar is true for 'ABCD...WXYZ1', 'ABCD...WXYZA', and 'ABCD...WXYZB'.

SHA

The method chosen for this article is SHA or the Secure Hash Algorithm. SHA is a NIST-approved, collision resistant hash developed by the NSA for NIST. Other hashes exist, such as MD4, MD5, and RIPE. Some popular hashes, such as MD4 and MD5, are no longer considered to be cryptographically secure. However, any collision resistant hash should suit our needs as cryptographic security is not a requirement.

Compiling and Integrating Crypto++

Please see the related article, Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment. This article is based upon basic assumptions presented in the previosly mentioned article. It also addresses most problems encountered from Command Line to MFC Projects (Errors C1083, C1189, LNK2001, LNK2004, and LNK2005).

The DirtyPad Program

MSDN has quite a few text editor examples � Superpad, Multipad, and WordPad to name a few. This article presents DirtyPad � a CEditView-derived class. The benefits of using a CEditView class is that the CEdit base class function GetWindowText(...). One has easy and convenient access to the underlying string data for hashing.

Create a new MFC SDI application.

At Step 5, choose to use MFC as a statically linked library.

At the final screen of the Wizard (Step 6), choose to derive the View from CEditView.

In stdafx.h, add the following:

#ifdef UNICODE
#  pragma comment( linker, "/ENTRY:wWinMainCRTStartup" )
#endif

#ifdef _DEBUG
#  pragma comment( lib, "cryptlibd" )
#else
#  pragma comment( lib, "cryptlib" )
#endif

In dirtypadDoc.h, add the following protected members:

BYTE m_pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
UINT m_ulLength;

DIGESTSIZE is a constant defined in the SHA class, which is part of the CryptoPP namespace. In the case of SHA256, it is 256/8 = 32 bytes. If one choses a different hash, the digest may be a different size. m_pcbDigest will be updated as outlined below.

As a side note, the author tries not to pollute the global names space with symbols. So, instead of issuing a 'using namespace CryptoPP', objects are brought in as required with the scope operator. See Migrating to Namespaces by Herb Sutter in the October 2000 issue of Dr. Dobbs Journal for an excellent treatment of the subject.

DirtyPad will create snapshots of the document by means of a hash at strategic times during the document's life by Overriding MFC Functions (i.e., responding to Messages sent by the Operating System).

OnOpenDocument()

Though a new document is empty, the program needs to hash the document. An empty document does not equate to a NULL hash.

BOOL CDirtypadDoc::OnNewDocument()
{
   if (!CDocument::OnNewDocument())
      return FALSE;

   ((CEditView*)m_viewList.GetHead())->SetWindowText(NULL);

   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );
   m_ulLength = szText.GetLength();

   return TRUE;
}

OnSaveDocument(�)

OnSaveDocument(�) is a fairly trivial implementation: it calculates the hash of the document (saving it m_pcbDigest), and then allows the CDocument base class to perform its work

BOOL CDirtypadDoc::OnSaveDocument(LPCTSTR lpszPathName)
{
   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );
   m_ulLength = szText.GetLength();

   return CDocument::OnSaveDocument(lpszPathName);
}

SaveModified()

SaveModified() is the fuction that will prove to be most interesting for this article. The function calculates a hash of the current document, and compares it with a previously saved hash.

BOOL CDirtypadDoc::SaveModified()
{  
   BYTE pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
   CString szText;

   if( FALSE == GetText( szText ) )
      { return CDocument::SaveModified(); }

   // The document must have changed if

   //   the lengths are different

   if( m_ulLength != szText.GetLength() )
       { return CDocument::SaveModified(); }

   // Calculate the current hash

   if( FALSE == CalculateDigest( szText, szText.GetLength(), pcbDigest ) )
      { return CDocument::SaveModified(); }

   // Compare the current hash with the previous hash

   if( 0 != memcmp( m_pcbDigest, pcbDigest, CryptoPP::SHA256::DIGESTSIZE ) )
      { return CDocument::SaveModified(); }

   // Any non-zero will do

   return TRUE;
}

DirtyPad will always defer to CDocument::SaveModified() under the following two conditions:

  • The underlying CEdit string data is not available (GetText(�) returned FALSE)
  • The previous and current document lengths are different
  • The program fails at calculating the current document's hash

Should the above conditions not fail, the previous document's hash is compared to the current document's hash. If the hashes are the same, TRUE is returned to indicate DirtyPad handled the message. If the hashes are different, base class CDocument::SaveModified() is called. The later case presents the dialog below.

Add the following protected member function to calculate the hash of the document:

BOOL CDirtypadDoc::CalculateHash( LPCTSTR pszText, UINT nLength,
                                  BYTE *pcbDigest )
{    try {

      // nLength is Character Count (not BYTE count)

      CryptoPP::SHA256 hash;
      hash.Update( (BYTE*)(LPCTSTR)pszText, nLength * sizeof(TCHAR) );
      hash.Final( pcbDigest );
   }

   catch( ... )
      { bReturn = FALSE; }

   return TRUE;
}

When hashing, if one finds the document has other data members (such as an INT) that should be included in the document's state for hashing, perform the following:

CryptoPP::SHA256 hash;
INT i = 1;
hash.Update( (UCHAR*)&i, sizeof(INT) );

One may be able to actually call CEdit::SetModify(). However, be aware that MSDN states the following (from CEditView::GetEditCtrl() ):

WARNING! Using the CEdit object can change the state of the underlying Windows edit control. For example, you should not change the tab settings using the CEdit::SetTabStops function because CEditView caches these settings for use both in the edit control and in printing. Instead, use CEditView::SetTabStops.

Document Sizes

The author's sampling of common word processor documents in a Corporate environment of 600 users revealed most were less than 50 Kb in size (though a 100 Kb file was not uncommon). In the case of Word documents, the statistic was 35 Kb with N ≈ 2800 and σ = 15 Kb. With that said, the following is offered for argumentative readers who like to offer 100 MB or 1.0 Gb files as an argument. Programming Applications for Microsoft Windows by Jeffrey Richter, Chapter 17, Memory Mapped Files should prove invaluable for working with a small subset of a large file. If one is modifying many parts of a large file, forgo hashing a write the entire file. At this point, the reader should consider redisigning his or her work to accomodate saving smaller file subsets.

Downloads

Revisions

  • 11.07.2007 Added Second Preimage Hash Information
  • 07.07.2007 Updated Collision Resistance Section
  • 06.07.2007 Added reference to Reversing CRC � Theory and Practice
  • 06.07.2007 Added CRC Error Detection
  • 05.28.2007 Updated CRC vs Hash Section
  • 01.01.2007 Visual Studio 8.0 Port
  • 01.01.2007 Added m_lLength member
  • 12.23.2006 Added Bit Propogation Tables
  • 12.23.2006 Uploaded BitProp Source Code
  • 12.19.2006 Added CRC vs Hash Section
  • 12.19.2006 Added Collision Resistance Section
  • 12.19.2006 Added File Checksums
  • 11.16.2006 Grammar and Spelling Corrections
  • 11.14.2006 Modified Crypto++ Content
  • 11.14.2006 Added Reference to Richter's Work
  • 09.19.2005 Initial Release

Checksums

  • BitProp.zip
    MD5: 9DB820350F29B730B12492629B490DD4
    SHA-1: C907D5AC1013C3A9C2934D5FF5235CB8121AC461
  • DirtyPad_demo.zip
    MD5: F5AFEB014AE8A4A6EB023CEA95A29C6E
    SHA-1: 770C45E7FE41924DDAE57369EF156B6D0BE6A1D8
  • DirtyPad_src.zip
    MD5: 62881C0BC7587D47BD80198A857CA140
    SHA-1: B9577FB44C7AEA2026BF4FD900B658437843D3A4

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here