Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C++

An NTFS Parser Lib

4.96/5 (44 votes)
30 May 2010BSD16 min read 210.5K   12.6K  
A C++ library to help in parsing an NTFS volume, file record and attributes.

Introduction

This is a library to help in parsing an NTFS volume, as well as file records and attributes. The readers are assumed to have deep knowledge about NTFS and C++ programming.

I will not introduce NTFS concepts here as the introduction will be either a big animal or nothing at all. Search the best document about NTFS here.

Being an OS fan, I was shameful to have very little knowledge about the file system. Every time I read an OS related book, I was at a loss in the chapter "File System". The contents were either too concise for a deep understanding, or too tedious to keep reading. So I decided to write some short codes to find out what was going on in my hard disk. I picked NTFS as it's the file system in my box, and almost everyone says it's a good design, at least not a bad one.

At first, it was quite painful as there was very little documentation available. Microsoft didn't make its so called "New Technology File System" public. Only pieces of information could be found over the web. After studying the collected documents for some days, the cloud over my head scattered gradually. After some successful testing, I thought it was okay to write a library to facilitate NTFS parsing, also to deepen my knowledge.

Windows NT tries to construct an object oriented Operating System. At the very beginning, I hesitated in choosing whether to use C++ classes or traditional C procedures to fulfill the task. As an important part of the OS, it should be efficient and compact, as well as have scalability and manageability. The OS kernel must be written in C. But I'm writing a user land library, and after studying NTFS data structures thoroughly and carefully, I decided to use C++ classes to encapsulate them.

NTFS is an advanced journaling file system which fits the needs from home PCs to data servers. I haven't implemented all of its features. The following parts are not supported yet:

  1. Journaling
  2. Security
  3. Encryption and compression
  4. Some other advanced features

Demo projects

1. ntfsundel

Its purpose is to search and recover deleted files.

It seems a hard job, but it took me less than an hour to implement it by using this library, and much time was wasted on adjusting the dialog interface. Of course, this is rather a simple test program than a commercial product. I didn't check if the freed clusters had been modified by another file (it's one reason why commercial tools take such a long time when analyzing a big volume).

NTFSParseLib/ntfsundel.JPG

2. ntfsdump

Dump the first 16K of a file. As this library reads data directly from disk sectors, we can bypass the OS protection and peek normally inaccessible files, such as those located in "Windows\System32\config".

NTFSParseLib/ntfsdump.JPG

3. ntfsdir

List sub files and directories.

NTFSParseLib/ntfsdir.JPG

4. ntfsattr

List attributes of a file or a directory.

NTFSParseLib/ntfsattr.JPG

Source code

1. Source files

The source contains five .h files. I prefer coding directly in include files when programming C++ because it eases the deployment a lot, and looks cool too. Just include the .h file and everything is done, without the need to add .cpp files to the project. The library is part of your own source, and an unreferenced library source code is silently discarded by the compiler. Of course, it will be difficult to implement a large system this way, when classes reference each other. I don't know how Microsoft ATL achieves this goal.

1. NTFS.h

Include this file in your source. No other includes are needed.

2. NTFS_DataType.h

NTFS common data structures and data type definitions. No classes, only structures.

3. NTFS_Common.h

NTFS data structures and data type definitions specific to this library. And a single list implementation CSList to help in managing objects of the same type.

4. NTFS_FileRecord.h

NTFS volume and file record classes definition and implementation.

5. NTFS_Attribute.h

NTFS attributes classes and helper classes definition and implementation.

2. Coding

Having been an embedded system designer for about ten years, I am accustomed to limited system resources and digging the full capacity of hardware (think about implementing an IP stack on an 8 bit CPU running at 2MIPS with only 256 bytes of RAM). On a PC nowadays, RAM and CPU speed are not problems anymore, but I still keep the habit of writing compact code which runs as efficient and fast as possible.

To achieve this goal, many data buffers are shared between different objects in this library. To fulfill the different tasks, playing tricks with a pointer is a must, though dangerous. C++ helps us in memory management by introducing a constructor and a destructor, as well as a copy constructor, but that's not enough. Otherwise, there won't be the so called "Smart Pointer" which is just a C++ style trick about a pointer (of course, if you are not "smart" enough, it will lead to "smart" errors that are hard to discover).

I am trying to make this library more useful than a simple test. The source code and demo projects are developed in VC6.0 SP6, and can also be compiled in VC10.0. The binaries are tested in Windows XP SP3 and Windows7. I have put many tracing messages which will be shown in the Output window of Visual Studio to help debugging. The library is Unicode compatible, and can be compiled into ANSI or Unicode binaries. Define _UNICODE to make a Unicode build. Just like an NT kernel, NTFS uses Unicode to store file names. So a Unicode build will run faster than an ANSI one. All passed or returned pointers and references which should not be modified by the target are decorated as "const". The compiler will warn us if we try to modify these buffers or objects (but I offend my own rule time and time by typecasting them to non-constant pointers). And I have added validation code to prevent bad parameters and incorrect data. You cannot be too careful when handling disk volumes.

This library reads disk sectors frequently. So I will maintain some buffers to fasten data access. Though the OS has already helped us with the disk cache, a user land buffer will be a plus.

As it directly accesses the disk sectors, you must have administrator privileges to run the demo projects. In Windows7, only getting administrator privilege is not enough; an elevated privilege is required. You should be the user "Administrator" or get the elevated privilege to successfully open a volume. This library accesses the disk in read-only mode; it should be safe and will not harm your disk volume. Use it at your own risk.

NTFS volume and file record classes

1. CNTFSVolume

This class encapsulates a single NTFS volume.

volume is the volume name;, e.g.: 'C', 'D'. This is the only constructor. It does the following:

User should call this function immediately after the constructor to verify everything is OK. If this function returns FALSE, no other processing should be done.

Returns the count of file records in this volume. It's not the sum of all the current files and directories, as deleted files may still occupy record slots.

Size of disk's physical sector in bytes. Normally 512. Get from BPB.

Size of a single file record in bytes. Normally 1024. Get from BPB.

Size of an index block in bytes. Normally 4096. Get from BPB.

Relative start address of the $MFT metafile. Get from BPB.

Return value: TRUE on success. FALSE when attrType is not a valid attribute type.

Installs a volume scope callback function to be called once a specific attribute is found. Can be used to peek the raw attribute stream before it's being processed.

Removes all volume scope callback functions.

  1. CNTFSVolume(_TCHAR volume)
    1. Opens the volume in read-only mode, and gets a handle to directly access the disk's physical sectors.
    2. Reads BPB, does some verification, and stores the needed information.
    3. Parses NTFS metafile $Volume, reads and verifies the NTFS version.
    4. Parses the NTFS metafile $MFT, gets its $DATA attribute to locate other file records in a fragmented $MFT. NTFS tries to keep the file records continuous by reserving some buffer after $MFT. But in my eight years old Notebook, $MFT is fragmented into three parts in the system volume.
  2. BOOL IsVolumeOK() const
  3. ULONGLONG GetRecordsCount() const
  4. DWORD GetSectorSize() const
  5. DWORD GetFileRecordSize() const
  6. DWORD GetIndexBlockSize() const
  7. ULONGLONG GetMFTAddr() const
  8. BOOL InstallAttrRawCB(DWORD attrType, ATTR_RAW_CALLBACK cb)
    • attrType: Attribute type.
    • cb: Callback function.
  9. void ClearAttrRawCB()

2. CFileRecord

Parses a single file record. It's the most important class. NTFS treats almost everything as files, even the boot sector.

volume represents which volume this file record belongs to.

fileRef is the dile reference of the file to be parsed.

Return value: TRUE on success. Otherwise FALSE. When this function fails, no further processing should be done.

This function reads the file record from the disk, then verifies and patches the update sequence numbers. The user can parse as many files as possible one by one. The previously parsed data will be freed.

Parse selected attributes (chosen by the SetAttrMask() routine) of a file record. It is the biggest and most time consuming routine in the lib. All selected attributes are parsed into the corresponding C++ objects and inserted into a separate list by their type.

Return value: TRUE on success. FALSE when attrType is not valid.

Installs a file record scope callback function to be called once a specific attribute is found. Can be used to peek the raw attribute stream before it's being processed.

When ParseAttrs() finds an attribute, it will first lookup in CFileRecord to find the installed callback function and calls it. If nothing is found, it will continue searching the callback functions installed in the CNTFSVolume object this file record belongs to.

Removes all file record scope callback functions.

mask has the attributes to parse. Defined in NTFS_Common.h as MASK_???.

User can pick the attributes to parse and discard the unwanted ones to save time and RAM. For example, you needn't waste time parsing the $DATA attribute if you only want to get the file's size and timestamp. $STANDARD_INFORMATION and $ATTRIBUTE_LIST will always be parsed whether they are picked or not, but unwanted attributes in $ATTRIBUTE_LIST will be discarded.

This function should be called before ParseAttrs().

This routine traverses all the parsed attributes of a file record and synchronously calls the user defined callback function, and provides user the parsed C++ object of the attribute.

This routine should be called after ParseAttrs().

Find the first attribute with type "attrType" contained in this file record. If no attribute of "attrType" is found, NULL is returned. Once called, the internal index moves to the first element.

This routine should be called after ParseAttrs().

Find the next attribute with type "attrType" contained in this file record. If no more attribute of "attrType" is found, NULL is returned. Once called, the internal index is moved to next.

This routine should be called after FindFirstAttr().

C++
CAttrBase *ab = FindFirstAttr(ATTR_TYPE_FILENAME)
while (ab)
{
    // process ab here
    ab = FindNextAttr(ATTR_TYPE_FILENAME);
}

The MFC CFileFind class is really a bad design and error prone. So I didn't follow its style.

Return value:

A single file record may have several file names ($FILE_NAME attribute). The first Win32 name will be returned.

Get the file size in bytes. Get from the $FILE_NAME attribute.

Get file last alteration time, creation time, and last access time. The time is already converted to the time zone set in the system. Get from the $STANDARD_INFORMATION attribute.

Traverse all the subentries located in a file record (a directory file) and synchronously call the user defined callback function, and provide user all the subentries encapsulated by the CIndexEntry class. Useful in enumerating sub files and directories. $INDEX_ROOT and $INEX_ALLOCATION attributes must have been parsed already (see SetAttrMask()).

Return value: TRUE when found, otherwise FALSE.

It is used to find a sub file or directory. $INDEX_ROOT and $INEX_ALLOCATION attributes must have been parsed already (see SetAttrMask()).

name is the file data stream name. NULL for unnamed stream.

Find the specific data stream by name. NTFS files may have several data streams ($DATA attribute). File content is always located in an unnamed stream. The $DATA attribute must have been parsed already (see SetAttrMask()).

Check if this file record is deleted.

Check if this file record is a directory.

Check if it's a read-only file. Get from the $STANDARD_INFORMATION attribute.

Check if it's a hidden file. Get from the $STANDARD_INFORMATION attribute.

Check if it's a system file. Get from the $STANDARD_INFORMATION attribute.

Check if it's a compressed file. Get from the $STANDARD_INFORMATION attribute.

Check if it's an encrypted file. Get from the $STANDARD_INFORMATION attribute.

Check if it's a sparse file. Get from the $STANDARD_INFORMATION attribute.

  1. CFileRecord(const CNTFSVolume *volume)
  2. BOOL ParseFileRecord(ULONGLONG fileRef)
  3. BOOL ParseAttrs()
  4. BOOL InstallAttrRawCB(DWORD attrType, ATTR_RAW_CALLBACK cb)
    • attrType: Attribute type.
    • cb: Callback function.
  5. void ClearAttrRawCB()
  6. void SetAttrMask(DWORD mask)
  7. void TraverseAttrs(ATTRS_CALLBACK attrCallBack, void *context)
    • attrCallBack: User defined callback function
    • context: context to pass to the callback function
  8. const CAttrBase* FindFirstAttr(DWORD attrType) const
  9. const CAttrBase* FindNextAttr(DWORD attrType) const
  10. int GetFileName(_TCHAR *buf, DWORD bufLen) const
    • buf: Name buffer to hold the returned file name.
    • bufLen: Name buffer size in characters (not bytes!)
    • > 0: Name length in characters.
    • = 0: This file is unnamed.
    • < 0: Buffer size is less than the file name size, the negative value is the wanted buffer size. For example, a return value of -20 means you need a buffer with its size at least 20 characters.
  11. ULONGLONG GetFileSize() const
  12. void GetFileTime(FILETIME *writeTm, FILETIME *createTm = NULL, FILETIME *accessTm = NULL) const
  13. void TraverseSubEntries(SUBENTRY_CALLBACK seCallBack) const
  14. const BOOL FindSubEntry(const _TCHAR *fileName, CIndexEntry &ieFound) const
    • fileName: Sub file name to find
    • ieFound: CIndexEntry object found
  15. const CAttrBase* FindStream(_TCHAR *name = NULL)
  16. BOOL IsDeleted() const
  17. BOOL IsDirectory() const
  18. BOOL IsReadOnly() const
  19. BOOL IsHidden() const
  20. BOOL IsSystem() const
  21. BOOL IsCompressed() const
  22. BOOL IsEncrypted() const
  23. BOOL IsSparse() const

NTFS attributes classes

Attributes               Class

$STANDARD_INFORMATION    CAttr_StdInfo
$ATTRIBUTE_LIST          CAttr_AttrList<TYPE_RESIENT>
$FILE_NAME               CAttr_FileName
$VOLUME_NAME             CAttr_VolName
$VOLUME_INFORMATION      CAttr_VolInfo
$DATA                    CAttr_Data<TYPE_RESIDENT>
$INDEX_ROOT              CAttr_IndexRoot
$INDEX_ALLOCATION        CAttr_IndexAlloc
$BITMAP                  CAttr_Bitmap<TYPE_RESIENT>

NTFS attributes are classified into resident (CAttrResident) and nonresident (CAttrNonResident). Resident and nonresident attributes share a common header (CAttrBase). All attribute classes are derived from CAttrResident or CAttrNonResident, which are derived from CAttrBase. Some attributes, such as $DATA and $ATTRIBUTE_LIST can be resident or nonresident; these classes use a template parameter as their base class.

1. CAttrBase

Base class of all the attribute classes.

allocSize is the allocated size of the data in bytes. Just leave this parameter blank if you don't want it.

Return value: Actual size of the data in bytes.

Get size of this attribute's data in bytes. It's declared as a pure virtual function. The derived classes CAttrResident and CAttrNonResident will actually implement this function. Thanks to polymorphism introduced by C++, with this function and the following function ReadData(), resident and non-resident attributes can access their data in the same interface, though they divert so much.

Return value: TRUE on success, otherwise FALSE.

Read attribute data into a buffer.

C++
__inline const ATTR_HEADER_COMMON* GetAttrHeader() const
__inline DWORD GetAttrType() const
__inline DWORD GetAttrTotalSize() const
__inline BOOL IsNonResident() const
__inline WORD GetAttrFlags() const
int GetAttrName(char *buf, DWORD bufLen) const
int GetAttrName(wchar_t *buf, DWORD bufLen) const

Get attribute name. The return value obeys the same rule as CFileRecord::GetFileName()

__inline BOOL IsUnNamed() const

Check if this attribute is unnamed.

  1. CAttrBase(const ATTR_HEADER_COMMON *ahc, const CFileRecord *fr)
    • ahc: Points to the attribute header buffer.
    • fr: The file record which owns this attribute.
  2. virtual __inline ULONGLONG GetDataSize(ULONGLONG *allocSize = NULL) const = 0
  3. virtual BOOL ReadData(const ULONGLONG &offset, void *bufv, DWORD bufLen, DWORD *actural) const = 0
    • offset: Start address of the read pointer relative to beginning in bytes.
    • bufv: User provided buffer to receive the data.
    • bufLen: User provided buffer size in bytes.
    • actural: The actual size of data read. Sorry for the misspelling. I got it right now when Microsoft Word tells me, but I'm too lazy to find and replace all the errors in my source code. I suggest Microsoft add spell checking in Visual Studio to help us non-English speaking guys, he he.
  4. Other exported routines:

2. CAttrResident

 

Base class of all resident attribute classes.

Implements the virtual functions GetDataSize() and ReadData() specific to resident attributes.

3. CAttrNonResident

Base class of all non-resident attribute classes. Implements the virtual functions GetDataSize() and ReadData() specific to non-resident attributes. It's much more complicated than CAttrResident's implementation, as it should parse data runs and build a list to hold the information. I don't think the NTFS data run is a good design, because the saved disk space cannot compensate for the wasted parsing time.

4. CAttr_StdInfo

NTFSParseLib/stdinfo.JPG

Implements the $STANDARD_INFORMATION attribute. Derived from CAttrResident. Exported functions:

C++
void GetFileTime(FILETIME *writeTm, 
     FILETIME *createTm = NULL, FILETIME *accessTm = NULL) const
__inline DWORD GetFilePermission() const
__inline BOOL IsReadOnly() const
__inline BOOL IsHidden() const
__inline BOOL IsSystem() const
__inline BOOL IsCompressed() const
__inline BOOL IsEncrypted() const
__inline BOOL IsSparse() const

5. CAttr_FileName

NTFSParseLib/filename.JPG

Implements the $FILE_NAME attribute. Derived from CAttrResident and the CFileName helper class.

All useful functions are located in the CFileName base class which will be introduced later. File permissions and times located in a $FILE_NAME attribute will only be updated when the file name is changed, so related functions derived from CFileName are declared again as "private" in CAttr_FileName to prevent user from getting the wrong information. $STANDARD_INFORMATION and index entry keep the updated file permission and timestamp.

6. CAttr_VolInfo

NTFSParseLib/volinfo.JPG

Implements the $VOLUME_INFORMATION attribute. Derived from CAttrResident. Exported functions:

__inline WORD GetVersion()

Returns the NTFS volume version. High byte holds the major version, low byte the minor. In Windows XP and Windows7, the NTFS version is 3.1, Windows 2000 is 3.0, and Windows NT 1.2. NTFS volumes with version less than 3.0 is not supported by this library.

7. CAttr_VolName

NTFSParseLib/volname.JPG

Implements the $VOLUME_NAME attribute. Derived from CAttrResident.

Exported functions:

C++
__inline int GetName(wchar_t *buf, DWORD len) const
__inline int GetName(char *buf, DWORD len) const

Get the Unicode or ANSI volume name. The return value obeys the same rule as CFileRecord::GetFileName().

8. CAttr_Data

NTFSParseLib/data.JPG

Implements the $DATA attribute. Derived from a template class which is CAttrResident or CAttrNonResident.

GetDataSize() and ReadData() are derived from the template base class. We only need these two functions when handling the $DATA attribute.

9. CAttr_IndexRoot

NTFSParseLib/indexroot.JPG

Implements the $INDEX_ROOT attribute. Derived from the CAttrResident and CIndexEntryList helper classes. All useful functions are located in the CIndexEntry object held in CIndexEntryList which will be introduced later.

10. CAttr_IndexAlloc

NTFSParseLib/indexalloc.JPG

Implements the $INDEX_ALLOCATION attribute. Derived from CAttrNonResident.

11. CAttr_Bitmap

NTFSParseLib/bitmap.JPG

Implements the $BITMAP attribute. Derive from a template class which is CAttrResident or CAttrNonResident.

12. CAttr_AttrList

NTFSParseLib/attrlist.JPG

Implements the $ATTRIBUTE_LIST attribute. Derive from a template class which is CAttrResident or CAttrNonResident.

This is the most complicated attribute to process because it deals with a file record and all other attributes. But the implementation is concise, and the code is short.

User needn't care about this attribute; all parsed sub attributes will be inserted into the parent file record's attribute list, just as they are directly contained in the same file record.

Helper classes

1. CFileName

This class helps CAttr_FileName and CIndexEntry to process file name related information.

Exported functions:

C++
int Compare(const wchar_t *fn) const
int Compare(const char *fn) const

Compare the file name with the input string. Return 0 if they match, negative if the file name is smaller than the input string, and positive otherwise. This routine is used to search a specific file in the B+ tree constructed by the index root and index allocation.

C++
__inline ULONGLONG GetFileSize() const
__inline DWORD GetFilePermission() const
__inline BOOL IsReadOnly() const
__inline BOOL IsHidden() const
__inline BOOL IsSystem() const
__inline BOOL IsDirectory() const
__inline BOOL IsCompressed() const
__inline BOOL IsEncrypted() const
__inline BOOL IsSparse() const
int GetFileName(char *buf, DWORD bufLen) const
int GetFileName(wchar_t *buf, DWORD bufLen) const

Get the Unicode or ANSI file name. The return value obeys the same rule as CFileRecord:: GetFileName().

C++
__inline BOOL HasName() const

Check if it contains a file name or is unnamed.

C++
__inline BOOL IsWin32Name() const

File names which cannot fit into the DOS 8.3 format will have a DOS alias name. For example, the Win32 name "C:\Program files" will have a DOS compatible file name "C:\Progra~1". Use this function to check if it contains a legal Win32 name.

C++
void GetFileTime(FILETIME *writeTm, FILETIME *createTm = NULL, 
                 FILETIME *accessTm = NULL) const

2. CIndexEntry

This class encapsulates a single index entry of the file name. It is derived from CFileName, and all CFileName exported functions can be used directly.

Exported functions:

C++
__inline ULONGLONG GetFileReference() const

Get the file reference of this index entry.

C++
__inline BOOL IsSubNodePtr() const

Check if the index entry points to sub nodes. These entries link different index blocks into a B+ tree.

C++
__inline ULONGLONG GetSubNodeVCN() const

Use this function to locate the sub-node index block.

3. CIndexBlock

This class helps in parsing a single index block into a list of CIndexEntry.

License

This article, along with any associated source code and files, is licensed under The BSD License