Introduction
This is designed to take XML and make it easy to read, by adding appropriate line breaks and tab indentation.
I wrote this a few nights ago because I got sick of the XML's my work transmits (about 40mb), all on one line (save bandwidth). Notepad++ has a lovely plugin which does this tidy up, but was taking about 45 mins to do a file. This does it in about 5 seconds.
Since I only had the XML's I have to use to test on, there are probably some things I've overlooked. If you find that some XML's are not getting new lined \ indented properly, please send me a sample one.
Using the code
This little function will take 2 arguments, an input and output stream. Read through the input stream, format it, and write to the output stream.
I've written a very crude Win32 API input and output selection area, for simple usage. The original version I wrote just involved chucking the EXE in a directory with some XML's, double clicking it, and it would do every XML in that directory.
While the interface is obviously windows only, the actual function itself should be platform independent.
To call this function use
tidyXML(inputFile, outputFile);
With both inputFile
and outputFile
being references to the ifstream
and ofstream
.
Here is the function itself. Okay, I'm a bit lazy with comments. If you don't understand why a certain bit is done a certain way, or just want me to comment specific places, let me know in the comments below, and I'll try adding some more. But it should be pretty self explanatory.
void tidyXML(ifstream &input, ofstream &output) {
char currChar;
char nextChar;
int indent = -1;
string currKeyStore = "";
string lastKeyStore = "";
string valueOrJunkStore = "";
bool inKey = false;
bool skipNextIndent = false;
enum keyType {
unset,
infoLine,
entryKey,
exitKey,
emptyValue,
};
keyType lastKeyType = unset, currKeyType = unset;
while(true) {
currChar = input.get();
if (!input.good()) {
output << currKeyStore;
break;
}
nextChar = input.peek();
if (!input.good())
nextChar = '\0';
if (currChar == '<') {
inKey = true;
lastKeyType = currKeyType;
lastKeyStore = currKeyStore;
currKeyType = unset;
currKeyStore = "";
if (nextChar == '/')
currKeyType = exitKey;
else if (nextChar == '?' || nextChar == '!')
currKeyType = infoLine;
}
if (currKeyType == unset && nextChar == '>') {
if (currChar == '/')
currKeyType = emptyValue;
else
currKeyType = entryKey;
}
if (inKey)
currKeyStore += currChar;
else
valueOrJunkStore += currChar;
if (currChar == '>') {
inKey = false;
if (!skipNextIndent)
for (int i = 0; i < indent; ++i)
output << '\t';
output << lastKeyStore;
skipNextIndent = false;
if (lastKeyType == entryKey && currKeyType == exitKey) { skipNextIndent = true;
output << valueOrJunkStore;
} else if (lastKeyType != unset) { output << endl;
}
valueOrJunkStore = "";
if (lastKeyType == exitKey || lastKeyType == emptyValue)
--indent;
if (currKeyType == entryKey || currKeyType == emptyValue)
++indent;
}
}
}
I compiled the sample version (with a crude Win32 interface) using MinGW. With the following command:
mingw32-g++ --std=c++0x -Wall -fno-builtin -O3 -static *.cpp -lcomdlg32 -o tidyxml.exe
I've static linked it so that it should run as-is, without the dependency hell you can get.
Also provided is the double click build.bat file i use. It'll also strip out the debugging symbols.
Points of Interest
The main pain with writing this, is that you have to always read the next tag, before you know what to do with the current one. Or better put, save the current tag, read the next tag, so you can decide what to do with the saved tag.
The original version was just full of bools to plan it all out and track everything. These were replaced by the enum which makes the code a lot cleaner and easier to follow. Well I think so anyway.
History
- v1 - Original release.
- Renamed main function of formatXML to tidyXML . Split demo project out of source file which includes this function.