|
Hi there - sorry I didn't reply earlier, but I didn't get a notification of your posting for some reason.
The easiest thing is for you to send me the file you're having trouble with. Can you do that?
|
|
|
|
|
Perhaps I'm blind, but where do i find your email address?
|
|
|
|
|
I've just submitted 1.6 to CodeProject. I didn't have time to include heuristics for detecting Unicode, or even to implement all the tests I wanted, but this has certainly been more thoroughly tested than previous releases. I know such comments invite disaster, so I now expect the worst .
|
|
|
|
|
I've got an odd problem. While this class works fine on small files, once I get over ~1 million lines, I run into a problem.
For the purposes of timing, I have a file that is just:
stoptime = stoptime + 1
.
.
.
stoptime = stoptime + 1
stoptime = -1
for about a million lines. When I load with your loader, it loads it fine, except that the end looks like this:
stoptime = stoptime + 1
stoptime = stoptime + 1
stoptime = stoptimÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ
It doesn't loose too many of the lines, but it does lose some. I save a backup of the file before I go through this process, and it looks fine in NotePad, so saving is fine (again, if I load it, it looks like it's hashed at the end). This leads me to believe it's in my load code. Using ReadString is entirely too slow, so I was trying to use GetCharCount, and I think that's where I may have gone wrong. Here's my code, which is definately fast! It's nearly instantaneous, and I've deleted the hashing to see how many lines I lose, and it is really not many at all
m_REditControl.LimitText(file.GetCharCount());
if (file.IsFileUnicodeText())
{
_TCHAR *csBuffer = new _TCHAR[file.GetCharCount()];
long lMask = m_REditControl.GetEventMask();
m_REditControl.SetEventMask(lMask & ~ENM_CHANGE );
file.Read(csBuffer, file.GetCharCount());
m_progbarcontrol.SetPos(100);
m_REditControl.SetWindowText(csBuffer);
m_REditControl.SetEventMask(lMask | ENM_CHANGE );
m_REditControl.SetModify(false);
delete csBuffer;
}
else
{
char *csBuffer = new char[file.GetCharCount()];
long lMask = m_REditControl.GetEventMask();
m_REditControl.SetEventMask(lMask & ~ENM_CHANGE );
file.Read(csBuffer, file.GetCharCount());
m_progbarcontrol.SetPos(100);
CString csTemp;
mbstowcs(csTemp.GetBuffer(file.GetCharCount()), csBuffer, file.GetCharCount());
m_REditControl.SetWindowText(csTemp);
m_REditControl.SetEventMask(lMask | ENM_CHANGE );
m_REditControl.SetModify(false);
delete csBuffer;
}
The reason why I post this is that this seems independent of the exact size of the code, after a particular point. Is this imprecision in GetCharCount()?
(note, as may be obvious, I haven't worked with mfc long)
|
|
|
|
|
Hi there,
I'm not sure how accurate GetCharCount() is, except that I'm pretty sure it's inaccurate for multibyte files. The thing is, if you're calling Read() then you're not using my class at all. It's just doing a straight read of all the bytes out of the file into a buffer (Read is actually a CFile member, so it's not string-specific at all). In this case, you'd be best using GetLength (which should be more accurate). However, if you need Unicode handling and all the rest (and that's why we're here, after all), there's no way around ReadString. If it's any consolation, I've done some work to improve the performance of CStdioFileEx, and I should be releasing the new version, well, now actually, but I'm kind of busy and never get around to wrapping it up. Stay tuned.
|
|
|
|
|
Yeah, I know Read() isn't your class, but I use your CStdioFileEx in many other portions of the code, and I was hoping to bum off your Char count. It's close, on short files I pick up about four bogus characters at the end (which makes no sense at all, as we should hit an FEOF), while a 2.1 million line file loses about 100 lines, regardless of unicode or ansi (though the exact point at which it hashes moves a bit). I've looked over your GetCharCount() and it looks gold.
I'll try out the new version, I have a few other ideas to make that portion of the loader faster. I'm really hoping to get the above working though, as it makes the difference between a 40 second load and a 2 second one.
Thanks for working on the new version!
|
|
|
|
|
Hi,
I haven't tried out the new version as it just went up, but I hacked out a bit more on the problem. (NOTE: If you are using MFC 7.0 or above, you can use a CStringA on the non-unicode side, and avoid having to dynamically cast the char array. I'm on 6.0, so well, there you go.)
The problems where actually pretty obvious when I looked at it deeply (well, except for one). First, I wasn't null-terminating, and second, the second parameter of CFile::Read takes the number of bytes, not characters (if they are multi-byte). I missed that distinction and it was giving me fits.
It isn't giving me fits, but it is VERY fast (and no longer hashes). I can open up a 130mb file in ~1 second, which is much faster than I was getting reading a line at a time. I know that read isn't a part of this class, but I think it might be worth your while to overload ::Read to provide the same Unicode/ANSI ability that the rest of it shares. There are benefits to both read and readString.
I have tried a lot of different ways to avoid that dynamic cast on the non-unicode side, and it's just not working. I tried reading in as a CString, and the conversion was failing (giving me the hash on the last ~100 lines or so) still, so I just put up this for now, as it's still blistering fast, even if it will chew up your memory on an ANSI file.
That said, I'll go tinker with your new version now
(where file is a CStdioFileEx)
<br />
iChars = file.GetCharCount();<br />
<br />
if (file.IsFileUnicodeText())<br />
{<br />
CString csInput;<br />
unsigned long iNumRead = file.Read(csInput.GetBuffer(iChars), iChars*sizeof(_TCHAR));<br />
csInput.ReleaseBuffer();<br />
if (ferror(file.m_pStream))<br />
{<br />
err_msg(CString("Unexpected read error!"));<br />
return false;<br />
}<br />
CString csTotal = csInput.Right(csInput.GetLength()-1);
<br />
}<br />
<br />
else<br />
{<br />
char *szBuffer = new char[iChars+1];<br />
long lMask = m_REditControl.GetEventMask();<br />
m_REditControl.SetEventMask(lMask & ~ENM_CHANGE );<br />
unsigned long iNumRead = file.Read(szBuffer, iChars);<br />
if (ferror(file.m_pStream))<br />
{<br />
err_msg(CString("Unexpected read error!"));<br />
return false;<br />
}<br />
m_progbarcontrol.SetPos(100);<br />
szBuffer[iNumRead] = '\0';<br />
CString csTotal = szBuffer;<br />
delete szBuffer;<br />
<br />
} <br />
|
|
|
|
|
I forgot to mention that I've compiled this in Visual Studio 2005. I should mention that in the article. The old DSP is still there, but probably missing stuff. If you have v6 still, and you fix up the v6.0 DSP, please pass it to me and I'll include it in the demo project zip. Let me know of any problems.
As regards ::Read, you're right, I suppose I could include such a function, although I think it would probably be better to give it a different name to avoid messing with people's concept of what ::Read does. Perhaps just ReadChars, or even a new override of ReadString, with the number of characters specified (not bytes). I'll give it some thought. I'm sure nothing I've done would compare to the speed of reading the whole file as a single block!
Thanks!
|
|
|
|
|
I hadn't actually looked at the new version yet, as I was working on a custom OLE drag-drop system, which was fun. But I just loaded it up, and it looks like the old version is still up. I ran it through source safe and there was no difference between them.
That's true, as the basic functionality of ::Read is byte reading. I think an easy block read like I've done would help a lot, particularly if you are just reading data without doing anything to it. I/O is relatively slow, and not having to seek the harddrive for each line saves a lot of time. That said, thanks for doing readString the way you do, as I use it in my parser .
In the version I posted up, I left in my rich edit control manipulation (oops), but if you take that out, I'm sure you'd be able to clean that up even more. It IS fast, just make sure you take out the BOM on unicode, as using CFile::Read for block reading will leave it in.
Cheers, I'll be back next week, I hope the new version is up by then.
|
|
|
|
|
Sorry for yet more delays. I'm currently looking at improving the performance of the file operations, because testing has shown it to be somewhat slower than CStdioFile, but it shouldn't delay the new version too long.
|
|
|
|
|
Sorry for the interminable delay, everyone, but I changed computers, I had problems with the new version of Visual Studio and now I'm looking for a bit of test data I need. That on top of the usual stuff that takes up my time.
I hope to have the new version ready this week.
|
|
|
|
|
Hello,
I suggest to set nFileCodePage = GetACP() instead -1.
This allows to improve performance when GetACP = GetCurrentLocaleCodePage.
This always occurs when using the class for reading ANSI text file in MultiByte application (just as CStdioFile does).
My suggestion:
CStdioFileEx::CStdioFileEx(): CStdioFile()
{
m_bIsUnicodeText = false;
m_nFileCodePage = GetACP();
}
I hope this helps,
Manuel
|
|
|
|
|
Sounds totally reasonable. I'll look at it and probably include it in the next release.
I'm currently in the process of moving to VS2005. I'll try to get on and release a new version as quickly as I can.
|
|
|
|
|
Hi David,
Do you have the new version ready for us to test?? I am definately looking forward to it as I us this class a lot!!
Regards,
Dan
|
|
|
|
|
I'll try and get it ready soon. Sorry for the delay... you know, life keeps getting in the way.
|
|
|
|
|
Although a BOM for UTF-8 is optional per its specification, a number of programs expect to see the BOM to function correctly. The following modification to WriteString() addresses this need:
void CStdioFileEx::WriteString(LPCTSTR lpsz)
{
const char UTF8_BOM[] = { char(0xEF), char(0xBB), char(0xBF) };
wchar_t* pszUnicodeString = NULL;
char * pszMultiByteString= NULL;
try
{
// If writing Unicode and at the start of the file, need to write byte mark
if (m_nFlags & CStdioFileEx::modeWriteUnicode)
{
// If at position 0, write byte-order mark before writing anything else
if (!m_pStream || GetPosition() == 0)
{
wchar_t cBOM = (wchar_t)nUNICODE_BOM;
CFile::Write(&cBOM, sizeof(wchar_t));
}
}
// otherwise, if we are writing UTF-8 and at the start of the file, need to write UTF-8 byte mark
else if (m_nFileCodePage == CP_UTF8)
{
// If at position 0, write byte-order mark before writing anything else
if (!m_pStream || GetPosition() == 0)
{
CFile::Write(UTF8_BOM, sizeof(UTF8_BOM));
}
}
|
|
|
|
|
There are two problems with ReadString that I noticed:
1) the ReadString variant which takes a buffer and a length still reverts to the old CStdioFile behavior which is not Unicode aware.
2) the CStdioFileEx::ReadString (the one which takes a CString&) may read different strings depending on whether you compile with _UNICODE defined or not. In the cases when ReadString doesn't delegate to CStdioFile::ReadString, it will only read a full line of text if its length is less than 4096 characters. So if you read an ANSI file from an ANSI build, you will read a full line even if it is very long, but if you read the same file from a Unicode build, you will only read the first 4096 characters.
|
|
|
|
|
1) I'll look into this. The problem is, I'm not sure I can retain compatibility with the CStdioFile version, because it leaves \n characters in place, whereas with the line-reading code I use I probably won't be able to.
2) You're right. The length limit problem should be solved by the new release.
-- modified at 14:58 Sunday 24th June, 2007
|
|
|
|
|
As you see, I want to support multi-language. But some interface for communicate and file I/O don't change, and user interface chang to support multi-language.
It is easy to support multi-language by use CString ( define MACRO UNICODE ) and CStdioFileEx you provide ,but for those interface will not changed ,it is difficult.
I have find a class Named CStringA ,to support ansi strings, but don't know the way to support ansi file when MACRO UNICODE defined.
Can you give me some advice?
|
|
|
|
|
This doesn't really have an easy answer. If you want to use another string class to avoid depending on CString, you're going to have to end up replicating what CString does. There's a fair amount of spadework involved in supporting Unicode, which is why quite a lot of us are happy to leave it to Microsoft to supply us with CString.
STL does provide an alternative of sorts, but its multilingual support has always lagged MFC's, quite apart from the stress of looking at all those pointy brackets. However, I believe STL does now give you enough to handle multibyte and Unicode transparently, although how and in what version I'm not sure.
|
|
|
|
|
hello,
Why WriteString use CFile::Write method instead of CStdioFile::WriteString since version 1.3 of your project ?
Why don't you have implemented a CStdioFileEx::Write and CStdioFileEx::WriteString method ?
thanks for answers.
Regards
Cédric
|
|
|
|
|
Sorry Cedric, if you're still out there. I was checking through messages and came across yours, and realised I hadn't answered.
The reason I use ::Write is because it's a byte-mode write that doesn't impose any "interpretation" on line feeds and such like. It's easier and clearer to sort out what I'm going to write in my code, and then just say "write this", without wondering whether it's going to get mangled.
This sort of answers the second question. ::Write is a CFile member which deals with binary data. It doesn't know about text, and doesn't care. Overriding it would make no sense, because the purpose of StdioFileEx is to implement high-level handling of multilingual data. ::Write does what it needs to do, there's no override needed.
|
|
|
|
|
I added this code to CStdioFileEx::IsFileUnicode to detect a Unicode file that does not have a BOM after the code that looks just for the BOM ...
//JLW. The above test is not adequate for C:\WINDOWS\Debug\UserMode\userenv.log and userenv.bak.
//They are UNICODE but typically don't have the UNICODE BOM.
if(!bIsUnicode)
{
//NULL in 3rd arg tells API to use all available tests to determine whether the data in the buffer is likely to be Unicode text.
bIsUnicode = IsTextUnicode(sFilePath,sizeof(sFilePath),NULL);
}
|
|
|
|
|
Correction ->
bIsUnicode = IsTextUnicode(sFilePath,file.GetLength(),NULL); //Need size of file, not file name.
Thoug I'm still getting bIsUnicode==0 when examining a Unicode file that has no BOM.
|
|
|
|
|
Correction #2 -- it's important to examine the contents of the file, not the file name (doh !). This works ...
//Examine a subset of the file's contents.
file.SeekToBegin(); //Just in case.
wchar_t pBuf[200];
UINT uiLen = file.Read(pBuf,200);
bIsUnicode = IsTextUnicode(pBuf,uiLen,NULL);
|
|
|
|
|