Introduction
In a previous life I worked with a team of people developing server software. Our products consisted of
a set of frameworks and a bunch of plugin DLL's. It was never easy, during the QA debugging loop, to know if the thread of
control was in the framework exe file or if it had passed to a DLL. It was important to know where it was because that
would tell us who to go and talk to.
Our processing took a job in some form and applied one of many possible transforms to it. Each transform was the
responsibility of one developer, and the decisions on exactly which transform to apply was the responsibility of another.
It made sense to try and log the progress of a job through the entire process, capturing, along the way, the reasons why a
particular path was taken. That way we would be able to isolate a transformation problem to a specific module.
I took upon myself the task of writing a logging system that would achieve the following goals.
- Survive a crash of the application(s) being logged.
- Minimise performance impacts on the application(s) being logged. This was especially important because the logging system
wasn't going to be a 'debug only' deal. Release versions of our software would also do logging.
- Log, on a per module basis, things that each developer thought was relevant.
- Provide the ability to filter the log file so that the stuff you want to see stands out from the crowd.
- Be able to 'age' a logfile.
- Be an easy retrofit to existing code.
It didn't take long to realise that a service would be the easiest way to achieve goals 1 and 2. The problem was
that guaranteeing a message would be logged without impacting the performance of the application being logged meant that
using guaranteed 'flush to disk' was out of the question if the logging code was part of the application. You either let
the file system cache buffer your output, in which case you risk losing output if your application crashes, or you wait
around while the operating system flushes your output to the log file. Running the logging side of the application as a
separate process neatly sidesteps that conflict. After all, you don't really care if it takes a couple of seconds for a
log record to get to the logfile, as long as it gets there even if your application has crashed.
Goals 3 and 4 were achieved by standardising the log file format. It's amazing how log file formats vary. I chose a CSV formatted
text file. Each entry in the log file looks like this:
timestamp,sequence,numericcode,threadID,modulename,servername,messagestring
Commas in the
messagestring
portion don't separate fields. I think it's important to stress that the logfile format is
a function of the code doing the logging. The service itself merely buffers log strings and writes them to permanent storage. It places
no interpretation upon the data it handles.
You can open the logfile using notepad and expect to be able to read it. Part 2 of this series, however, will present a viewer
capable of displaying and filtering, in real time, the logfile. The viewer is smart enough to cope with embedded linefeeds/newlines
in the output. This makes it easy to include error messages from, for example, ODBC drivers, without having to do any massaging of the data,
an important consideration if you don't want to slow down your application.
Goal 5 was achieved by adding a thread to the service that performs automatic rollover of logfile names and cleanup of older log files.
Goal 6? I think I achieved it. You be the judge.
Why not use the Windows Event Log?
I decided against using the Windows Event Log primarily because (as pointed out by John M. Drescher below) a debugging session using the
log system can potentially generate hundreds of megabytes of log data, far in excess of the amounts of data the Event Log was designed to handle.
Additionally, it's important to remember that the Windows Event Log is a shared resource, shared by all software including the Operating System itself.
I think it's important to segregate system log data from application log data, particularly when a large part of that data may be debugging data.
System structure
The logging system consists of three parts. The first part is the logging service itself, covered in this article.
The second part is the client side code that gets compiled into the application(s) being logged, also covered in this article.
The third part is the log file viewer which is covered in part 2 of this series.
The logging service
When the
logger
service starts it immediately spawns a listener thread which opens a named pipe called
\\.\pipe\logger
.
A client connects to the pipe and sends a command string.
createlog,myproduct
. The service responds by checking to
see if it already has a logfile of this name under it's control; if so it creates a new named pipe instance called
\\.\pipe\myproduct.log
and connects the pipe to the logfile. If, however, the service doesn't yet have a
logfile of this name it creates a
CLogInstance
object, creates a named pipe and connects the two. The
ClogInstance
object
in turn creates a disk file called
myproduct.log
and copies anything written to the named pipe to the file.
The listener thread manages all the connections and log files through the CPipeData
object. Newly created CLogInstance
objects are maintained in a map keyed by the logfile name and newly created CServerPipeInstance
objects are stored in an array. Every
CServerPipeInstance
object is connected to an existing CLogInstance
object. There is only ever one CPipeData
instance
within the service.
The CPipeData object
class CPipeData : public CObject
{
DECLARE_DYNAMIC(CPipeData);
public:
CPipeData();
virtual ~CPipeData();
void AddPipe(CServerPipeInstance *pPipe);
void RemovePipe(DWORD wIndex);
void RotatePipes(DWORD wIndex);
CServerPipeInstance *GetPipe(DWORD wIndex)
{ return (CServerPipeInstance *) m_pipeArray.GetAt(wIndex); }
DWORD GetCount() const
{ return DWORD(m_handleArray.GetSize()); }
DWORD Wait()
{ return DWORD(WaitForMultipleObjects(m_handleArray.GetSize(),
m_handleArray.GetData(), FALSE, INFINITE));
}
private:
CObArray m_pipeArray;
CArray<HANDLE, HANDLE>m_handleArray;
};
The important things here are the
m_pipeArray
,
m_handleArray
member variables and the
Wait()
function.
The thread starts out by creating a named pipe, using overlapped I/O, that listens for connection requests. As connection requests arrive
they are parsed into the request verb (currently only
createlog
is supported) and the logfile name. The thread then searches
the
m_logMap
looking for a matching logfile name. If it doesn't find a match it creates a new
CLogInstance
object and
adds it to the log map. Then it creates a matching
CServerPipeInstance
object and connects it to the
CLogInstance
object
(either preexisting or newly created).
As each CServerPipeInstance
object is created the event handle associated with the named pipe (m_op.hEvent
in the
CServerPipeInstance
object) is added to an array of event handles which is used in a WaitForMultipleObjects()
call.
The CPipeData
object also maintains a parallel array containing a pointer to each CPipeInstance
object. There's a
1 to 1 correspondence between the handle array and the pipe array.
When the listener thread enters the WaitForMultipleObjects()
call the thread waits until data is written to one of the pipes it's
monitoring. When that happens the event handle is signalled by the system and the WaitForMultipleObjects()
call returns
a DWORD
that specifies the array index for the event handle that was signalled. We know which handle was signalled therefore we know
which pipe to operate on. The listener thread reads the data from that pipe and writes it to the head of the log text message queue managed
by the CLogInstance
object pointed at by the pipe.
There are actually two more handles in the handle array than there are log pipes. The handle at index 0 is an event handle used to signal the thread
when the service wants to shut down. See my article at
One use for overlapped I/O[^]
for a discussion of this. The handle at index 1 is an event handle used by the command named pipe \\.\pipe\logger
. Handles 2 to n
correspond to named pipes opened in response to a createlog
command sent down the command pipe.
Thus the handle array being monitored by the listening thread will always contain at least two handles, up to a limit of MAXIMUM_WAIT_OBJECTS
which, at least according to my copy of the Platform SDK, is defined as 64. This means that up to 62 modules can be logging through the same server instance.
The ClogInstance Object
class CLogInstance : public CObject
{
public:
CLogInstance();
~CLogInstance();
BOOL Create(LPCTSTR szBaseName);
void QueueMessage(LPCTSTR szLogEntry);
private:
static void LogProcessorThread(LPVOID data);
static void LogAgingThread(LPVOID data);
CStringList m_messageQueue;
CMutex m_mutex;
};
Each
CLogInstance
object runs a thread to monitor the log text queue and write the contents
to the logfile on permanent storage.
There's one CLogInstance
object per log file and zero to possibly many CServerPipeInstance
objects pointing to the one
CLogInstance
object.
The logfile is opened in a share mode that allows any other process to read or write to the file. For maximum performance we want to keep the file open
all the time but we need to provide read access to other processes so that we can actually view the contents of the log file while the service is running.
Write access? Sometimes it's useful to be able to delete the contents of the file without stopping the service.
Be aware that, because of log file aging, once we create a CLogInstance
object we never delete it even if all pipes connected to the
object are closed. This has just one implication. As long as the service is running you won't be able to delete the log file until log file aging has
kicked in.
The CServerPipeInstance object
class CServerPipeInstance : public CObject
{
public:
CServerPipeInstance();
~CServerPipeInstance();
protected:
CLogInstance *m_pOwner;
private:
HANDLE m_hPipe;
OVERLAPPED m_op;
DWORD m_dwRead;
CString m_csBaseName;
};
The important stuff here is the pointer to a
CLogInstance
object which is the owner of this pipe, and the
OVERLAPPED
structure.
Each
OVERLAPPED
instance contains an event handle which is used to monitor activity on that named pipe. The event handle is added to the
m_handleArray
member of the
CPipeData
object.
Each named pipe runs in message mode, which guarantees that a read won't return data until an entire string of data corresponding to a single
write has been returned. If the pipes were running in byte mode the results written to the log file would be a string of single bytes chosen
seemingly at random from all the pipes being monitored, which is not the desired result.
Once the data has been added to the queue the queue monitoring thread is signalled. It takes messages off the head of the queue and writes them to
permanent storage. Permanent storage has the same basename, myproduct.log
, stored in whichever directory you want. The source code
presented here for the service places them in a directory called logger
which lives under your %systemroot%
directory.
When the client application closes it's end of the pipe the same pipe event handle is signalled, causing the wait to terminate. The code checks the
result for the signalled handle using the GetOverlappedResult()
and if the operation has failed it uses GetLastError()
to
determine what action to take. If the pipe was broken it is removed from the pipe array, it's handle is removed from the handle array and the pipe
itself is deleted.
The application goes it's merry way doing whatever it has to do and shoving log messages down the pipe. The other end of the pipe takes messages off
as fast as it can and queues them up so that another thread can take care of actually writing them to disk whenever it gets the chance. You want to
clear the pipe as fast as possible otherwise the client will stall until there's space on the pipe for a new message. You also want to give the
application priority to the disk whereever possible. To an extent the use of file flushing on each write militates against this goal but we're
balancing performance impacts on the application being logged against ensuring that messages do get logged!
A gotcha in WaitForMultipleObjects()
There's a hidden gotcha in the way
WaitForMultipleObjects()
works. As part of my testing I wrote a quick and dirty client which opened
four pipes to the same logfile and then wrote 1000 strings to each pipe in turn. Ie, string to pipe1, string to pipe2, etc and wrap around to pipe1
when pipe 4 has been written. I expected to see the strings in the logfile in the same order but no, I found a block of dozens of strings for pipe1
followed by a block of dozens of strings to pipe2 etc. Upon investigation I found that MSDN does indeed document that if multiple event handles are
signalled simultaneously the
WaitForMultipleObjects()
call returns the index for the lowest numbered signalled event handle. Actually that
makes perfect sense. So I added a function,
RotatePipes()
, to take a handle which has just been serviced out of it's place in the handle array and
move it to the end, with the same change in ordering applying to the
CServerPipeInstance
array, thus making pipe servicing act in a
round-robin fashion. Only pipes created by the command pipe (pipes 2 to
n) are shuffled in this fashion.
In a real world scenario the round robin code might not be necessary (after all, the test program was doing nothing but write messages as fast as
possible, but that's the point of stress testing - to flush out situations like these).
The CLogMap class
class CLogMap : public CMapStringToOb
{
public:
CLogMap();
virtual ~CLogMap();
CLogInstance *GetLog(LPCTSTR szLogFileName);
private:
static void LogAgerThreadProc(LPVOID data);
void AgeLogFiles();
HANDLE m_agingThreadHandle;
};
This class is used to maintain a map of
CLogInstance
objects keyed on the logfile name. It also contains the thread (
LogAgerThreadProc()
)
used to age logfiles. The
GetLog()
method returns a
CLogInstance
object pointer which either corresponds to a pre-existing object or a new one if we didn't
already have this logfile under our control.
Log file aging
When the aging thread starts it calculates how many seconds remain until local midnight and then sleeps for that length of time.
When it awakens it blocks the queue monitor thread, closes the logfile, renames it to
myproduct.yyyymmdd.log
and creates a new file
for continued logging. It also enumerates existing aged log files and deletes files older than 7 days.
The myproduct.yyyymmdd.log
format was chosen so that aged logfiles will sort in directory listings in age order.
I hardcoded 7 days into the source code presented here. In a real world use of the logging system you would probably want to be able to specify
how long to keep log files but I think the presentation of how you do that is outside the scope of this article.
Log file aging gotchas
I ran into two gotchas. One is a direct result of my decision to calculate the time from now to local midnight and wait that period of time. The code
looks something like this.
while (bStop == FALSE)
{
switch (WaitForSingleObject(hStopEvent, CalculateTimeTillMidnight())
{
case WAIT_OBJECT_0:
bStop = TRUE;
break;
case WAIT_TIMEOUT:
break;
}
}
This is the most efficient way to code the waiting period. The thread simply doesn't run unless it's signalled to stop or until the timeout occurs. This is
a good thing. Code that doesn't run consumes no CPU cycles and indeed might even be paged out of physical memory into the pagefile after a reasonable period
of time. But there's a small problem here. If you want to test log file aging by moving the system clock forward to just before midnight
you're not going to see aged log files suddenly appear. The timeout is calculated at the start of the wait and, once started, makes no reference to the
system time. If you want to test log file aging this way it's necessary to change the system time, then stop and restart the service.
The other gotcha is in the way Windows handles renaming a file and creating a new file with the name the renamed file had. Ie, renaming
myproduct.log
to myproduct.yymmdd.log
and immediately creating a new file called myproduct.log
. Windows will
pull file attribute information INCLUDING creation date from the file cache. Thus, the newly created log file will have the creation date of the original
file even though it was created up to 24 hours later. I have to admit this behaviour puzzles me but there it is... This is documented in MSDN as
one of the notes to the CreateFile()
API.
Thus we do some trickery to set the creation date/time for the newly created logfile.
Server side filtering
Doesn't happen. Period. The service will log anything written to the log pipe. It copies verbatim anything written to a pipe it created to the logfile
attached to that pipe.
The client side code
This consists of a single class,
CLogger
. You instantiate an instance of the class as a global object in each application
or module which is to perform logging. Then, somewhere early in your code, you initialise the object with the module name, the name of the logfile and possibly
the server name (the name of the machine on which the log service is running). Then each time you
want to log something you call the
Log()
method. The
Log()
method takes a
printf()
style formatting string and
a variable number of arguments.
CLogger gblLogger;
gblLogger.Initialise(3, _T("MyProductModule1"), _T("MyProduct.log"));
gblLog.Log(_T("Format string", arg1, arg2, argn);
If you prefer you can write a bunch of operator overrides on the
CLogger
object to allow
gblLog.Log << data1 << data2;
but personally I don't like that way of writing to objects so I don't do it that way.
The Clogger
object instance needs to be initialised. My design envisages (but does not require) the object being global so the constructor
takes no parameters and merely initialises member variables to safe values.
Initalising your CLogger instance
void Initialise(
DWORD dwLogLevel,
LPCTSTR szModuleName,
LPCTSTR szLogFileName,
LPCTSTR szServerName = _T(".")
);
The first parameter tells the object the logging level desired. Typically you'd pull this out of the registry but to do that you need a way to put the value
in the registry in the first place. As with setting the retention period for aged log files I consider this to be outside the scope of this article.
The second parameter identifies the module that's producing messages using this particular CLogger
object instance. If your application
consists of an exe file plus a bunch of DLL's it's really helpful to log exactly which module created the log message.
The third parameter specifies the name of the logfile. It's a matter of personal taste whether each module logs to it's own file or if all parts of the
application log to the one file. Personal experience is that it's better to log to the one file provided you have a way to filter messages. The LogFileViewer
I present in the second part of this series provides that functionality.
An optional fourth parameter specifies a machine name to which log messages should be directed. This allows you to log multiple instances of your application(s)
running on different machines to a central logging server. If not specified this fourth parameter defaults to . (dot) which is an alias for the same machine
the application is running on.
Each message logged includes the name of the machine on which the module was running and the threadID
for the thread calling the Log()
function, which makes it easy to distinguish otherwise apparently identical messages.
If you examine the code for the Initialise()
function you'll find a 20 ms sleep (two timeslices on a single processor NT machine and at least 1 timeslice
on a multi cpu machine). Don't be tempted to remove it. Each time you call Initialise
you're writing to a named pipe in another process. That
other process needs time to do the things you've asked it to do. Thus, your process needs to suspend itself to give the service a chance at running before you
attempt to write data onto the log. If your application doesn't suspend itself you run the risk that you'll attempt to write to a log named pipe instance that
hasn't yet been created.
Why use a named pipe?
I chose to use named pipes as the communication mechanism between client code and the service because I wanted to be able to write to the same logfile from
more than one module in my applications. But I also didn't want to have to change the API between modules to include a log method. Would you want to extend the
parameter list of every function exposed from a DLL so that you could pass a pointer to a logging object? I thought not. By using named pipes I could add
logging functionality to any module without changing the external interfaces to that module.
Log levels
As noted earlier, the
CLogger
class expects a log level parameter. This provides the application with a way of pre-filtering which messages
will be logged. Opinions vary about this. On the one hand, it's good to log everything, particularly if there's a filtering method available when reviewing
the logfile. On the other hand, it's easy for developers to get carried away and log everything, which will impact both application/system performance
and create awesomely large log files. (We developers at the cutting edge have hundreds of gigabytes of storage available but in my experience
production systems lag behind developer systems by orders of magnitude).
I don't want to have to create special builds with extra logging. Hence the logging level.
- Level 0 - error codes in the range of 0 to 999
- Level 1 - warning codes in the range of 1000 to 1999
- Level 2 - success codes in the range of 2000 to 2999
- Level 3 - information codes in the range of 3000 to 3999
- and so on with ever rising levels of information codes
I chose this set of definitions due to a philosophy about what should be logged. I always want to be able to see error codes, so the
CLogger
class
will always report level 0 messages. I probably want to see warnings unless I'm really tight for disk space. Success codes? Some applications need them (for
example a system which creates auditable documents would probably want to log successes as well as failures). Information? That's there for we developers,
Q/A people.and those occasions when product support need more detailed information. Your customers may not like being asked to turn on level 17 logging and send
you a CD containing a 500 meg log file but if it gets their support issue resolved in days instead of weeks they'll appreciate it.
So the philosophy is to write code to log everything and then decide, at run time, not to log certain things. Compiling the calls (and the format strings) into
your end product makes the binary deliverables larger which is probably forgivable. What won't be forgiven is if all of this halves the performance of your
application. The Log()
function is written to do the minimal amount of work necessary to determine if the message is even going to be logged.
void CLogger::Log(DWORD dwErrorCode, LPCTSTR szFormatString, ...)
{
if (dwErrorCode / 1000 <= m_dwLogLevel)
{
ASSERT(szFormatString);
ASSERT(AfxIsValidString(szFormatString));
va_list args;
va_start(args, szFormatString);
Log(dwErrorCode, szFormatString, args);
va_end(args);
}
}
That's a pretty minimal amount of work for cases where the message level is greater than the log level. Ie, when the message is going to be discarded.
What happens if the service isn't running?
The client side code waits 100 ms (defined as the constant
PIPE_WAIT_TIME
) for a pipe to become available. If a pipe doesn't become available in
time the the code retries up to 10 times (defined as the constant
PIPE_RETRY_COUNT
) and eventually gives up. This means that you will incur
an overhead of up to 1 second per module being logged, incurred on the first
Log()
call. Doesn't sound like much but if you're using 62
modules that's a minute elapsed. I considered adding code to start the service if it wasn't running but decided not to. Such code would only be useful
if the application was running under an account with administration privileges.
These numbers, 100 ms and 10 times respectively, are based on gut feel. If you're logging to a very heavily loaded machine or across a heavily loaded
network you may need to increase the numbers.
We try 10 times with a timeout of 100ms per try rather than try once with a timeout of 1 second because each retry tries to establish if the pipe even exists. If
the pipe doesn't exist the code sends a pipe creation request to the command pipe and tries again. If our service is being overwhelmed with messages from
various parts of our application this gives it a chance to catch up and create a new communications conduit.
If you don't intend to include the logging service as part of your product deliverable but do want to use it for development be sure to include some
mechanism (such as #ifdef _DEBUG
) to exclude the logging code from release builds.
Other classes
There are three utility classes included in this project. The first is the
CSleeper
class. It's sole purpose in life is to create an
event handle that will never be signalled and wait a specified amount of time on that handle. I wrote this class to avoid possible deadlocks. I can't anticipate
the kinds of applications that the
CLogger
class will find itself in. If it's compiled into an application that uses DDE the
Sleep()
API becomes potentially unsafe. See the MSDN documentation on
Sleep()
.
The second utility class is the CSecurityToken
class. This simply creates a SECURITY_ATTRIBUTES
structure initialised so that any
system object created using it grants access to anyone. Dangerous? Maybe, if your code does not otherwise check permissions or if your code can do dangerous
things.
The final utility class is the CClientPipeInstance
class which represents the client side of the named pipe.
Tested environments
I wrote this code on a 2 CPU Windows 2000 machine (dual Athlon 2000+) and tested it (obviously) on the development machine, and a single CPU Windows 2000
machine. Since it's multi-threaded it's important it be tested on a machine with 2 physical CPUs. I've also tested logging from one machine to another.
Using the code in your applications
The service itself was written using the VC6 service wizard. Thus it supports self registration if you launch it once using
logger /service
This registers it as a service in the Windows Service Control manager.
Client code needs to include the following files.
sleeper.h
sleeper.cpp
SecurityToken.h
SecurityToken.cpp
clientpipeinstance.h
clientpipeinstance.cpp
logger.h
logger.cpp
These files appear in the
loggerclient_src.zip
download.
Installing the service
If you build the service on the same machine you're running it on there's no need to install the service, the build process automatically registers it with the service control manager.
On other machines you need to do this from a command prompt
logger /service
which will register the service as a manual start service. Since the client code doesn't attempt to start the service if it isn't running you need to either open the services control panel applet and make the service automatic (and reboot or start it) or modify this line
SC_HANDLE hService = ::CreateService(
hSCM, m_szServiceName, m_szServiceName,
SERVICE_ALL_ACCESS, SERVICE_WIN32_OWN_PROCESS,
SERVICE_DEMAND_START, SERVICE_ERROR_NORMAL,
szFilePath, NULL, NULL, _T("RPCSS\0"), NULL, NULL);
in the
CServiceModule::Install()
function to this
SC_HANDLE hService = ::CreateService(
hSCM, m_szServiceName, m_szServiceName,
SERVICE_ALL_ACCESS, SERVICE_WIN32_OWN_PROCESS,
SERVICE_AUTO_START, SERVICE_ERROR_NORMAL,
szFilePath, NULL, NULL, _T("RPCSS\0"), NULL, NULL);
Uninstalling the service
To uninstall the service do this from a command prompt.
logger /unregserver
Revision History
- November 16, 2003, Initial version.
- November 18, 2003. Incorporated some suggestions by
Forogar[^]
(see topic thread below).
Added some missing code in the logfile aging method. As initially presented the creation date for each log file was not being set, thus log file aging would not in fact
delete old log files.
Refactored some of the code and created a new class, CLogMap
to host the CLogInstance
objects and perform log file aging.
- November 24, 2003. Found and fixed a bug in the
CLogInstance::LogProcessorThread()
function. As previously presented it didn't correctly
handle cases where some other process (ie, the Log File Viewer in progress) deleted the contents of the file. In the new version, just prior to writing a message
to the log file, we do a seek to the current end of the log file.
I changed the way I generated a timestamp to use the GetLocalTime()
API rather than CTime
. I have no idea how I managed to miss this API.
Fortunately CodeProjects Feature Article showed this[^]
article at just the right time. Thanks to RK_2000[^].
The new timestamp at least pretends to present millisecond timing.
- November 29, 2003. Added some explanatory remarks about why I used named pipes as the communications mechanism between the service and client code. Also added
an explanation of why the client code tries 10 times to establish a connection before giving up in disgust.
- December 9, 2004. Updated the downloads.
- January 7, 2004. Added sections describing how to make the service auto start, and how to uninistall it.