Click here to Skip to main content

C / C++ / MFC

_findfirst and fopen very slow

cristiapi31-Jan-19 1:13

31-Jan-19 1:13

I have about 1800 text files in a dir and their number is slowly growing (several files in a day). The size of a file is about 40 kB.
I need to enumerate all the file with the ".states" extension, open the file, check to see if there is a line that starts with a particular sequence of 4 chars and if the line exists, I save the line in a std::vector.

I use the following code, but it takes very long time at the first run (the subsequent runs are very fast):

C++

struct _finddata_t fd; long hFile;
if((hFile=_findfirst("*.states", &fd))== -1L) return; // File not found
do {
    FILE *f= fopen(fd.name, "rS");
    while(fgets(buf, sizeof(buf), f)) {
        if(0 == _strnicmp(buf, "ABCD", 4)) {
            Save buf in a std::vector
            break;
        }
    }
    fclose(f);
} while(_findnext(hFile, &fd) == 0);
_findclose(hFile);

Is there any way to speedup the code?

If I merge all the files in a single file, I solve the problem, but I prefer to keep all the original files.

Re: _findfirst and fopen very slow

David Crow31-Jan-19 3:50

31-Jan-19 3:50

Are the files that were in that folder on Monday (for example) still there on Tuesday? In other words, are you processing every file in that folder, or just the new ones?

Keep in mind that file I/O is arguably one of the slowest operations on a computer. It can be sped up to a marginal degree, but you are ultimately at the mercy of the disk.

"One man's wage rise is another man's price increase." - Harold Wilson

"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

"You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

modified 31-Jan-19 12:11pm.

Re: _findfirst and fopen very slow

cristiapi31-Jan-19 3:58

31-Jan-19 3:58

David Crow wrote:
Are the files that were in that folder on Monday still there on Tuesday?

Yes.

In other words, are you processing every file in that folder, or just the new ones?

Every file in that folder.

Re: _findfirst and fopen very slow

David Crow31-Jan-19 4:23

31-Jan-19 4:23

So do the files that were processed on a Monday need to be processed again the next day?

"One man's wage rise is another man's price increase." - Harold Wilson

"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

"You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

Re: _findfirst and fopen very slow

cristiapi31-Jan-19 5:34

31-Jan-19 5:34

David Crow wrote:
So do the files that were processed on a Monday need to be processed again the next day?

Yes. All those files can be thought as a database that I need to read every time I start the program.

Re: _findfirst and fopen very slow

David Crow31-Jan-19 6:01

31-Jan-19 6:01

Have you considered doing the processing in a (background) worker thread?

"One man's wage rise is another man's price increase." - Harold Wilson

"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

"You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

Re: _findfirst and fopen very slow

cristiapi31-Jan-19 6:07

31-Jan-19 6:07

No, because when I start the program, I need to first process the files.

Re: _findfirst and fopen very slow

David Crow31-Jan-19 6:14

31-Jan-19 6:14

That fact does not negate the need/use of a worker thread. Part of the program's slowness may be that of perception. By having a responsive UI (not saying that yours is), the perception that the program is running slow is minimized.

"One man's wage rise is another man's price increase." - Harold Wilson

"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

"You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

Re: _findfirst and fopen very slow

jeron131-Jan-19 4:58

31-Jan-19 4:58

If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

"the debugger doesn't tell me anything because this code compiles just fine" - random QA comment
"Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst
"I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle

Re: _findfirst and fopen very slow

cristiapi31-Jan-19 5:20

31-Jan-19 5:20

jeron1 wrote:
If you comment out the 'Save buf in a std::vector' code, is it still unacceptably slow?

Yes; that code takes a totally negligible amount of time.

If I only do the search, without opening the file, it's very fast, but if I also open the file, the code is terribly slow.

Re: _findfirst and fopen very slow

David Crow31-Jan-19 6:04

31-Jan-19 6:04

Member 3648633 wrote:

If I only do the search, without opening the file

How can you search the contents of a file without first opening it?

"One man's wage rise is another man's price increase." - Harold Wilson

"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

"You can easily judge the character of a man by how he treats those who can do nothing for him." - James D. Miles

modified 31-Jan-19 12:14pm.

Re: _findfirst and fopen very slow

cristiapi31-Jan-19 6:10

31-Jan-19 6:10

For "search" I meant the file enumeration with _findnext().

Re: _findfirst and fopen very slow

jeron131-Jan-19 8:28

31-Jan-19 8:28

Maybe read the whole file at once as opposed to many fgets() calls, then do your string search in RAM?

"the debugger doesn't tell me anything because this code compiles just fine" - random QA comment
"Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst
"I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle

Re: _findfirst and fopen very slow

cristiapi1-Feb-19 5:57

1-Feb-19 5:57

I tried that method, but nothing changes.

Any kind of file opening (including MapViewOfFile) terribly slows the process down.

Re: _findfirst and fopen very slow

Randor 31-Jan-19 14:23

31-Jan-19 14:23

Hi,

Some comments:

1.) Reading files may be slightly faster if you read using a multiple of the drive sector size or read it all at once.

2.) If you've ever wondered why file-backed cache implementations save files into a hierarchical folder structure it's because enumerating 10,000 files in a single folder may cause a small performance hit. If you plan on storing many thousands of files you may want to design a folder structure. Maybe something simple such as alphabetical A-C, D-F ... or something based on timestamp. This is not much of an issue on a modern SSD but old spindle drives take a performance hit.

3.) The code you have shown above is reading the file contents into a local buffer. You would get a huge performance boost by using the MapViewOfFile function to map the file directly into your process space.

Have a look at the Creating a View Within a File sample. This sample is demonstrating how to take a large file and map only 1kb at a time into your process. Don't do that.

You stated that your files are around ~40 kb so I'd recommend mapping the entire file into your process address space. I'd also recommend using two file mappings. While FileA is being processed you can have the operating system map FileB into your process. This would mitigate any latency caused by the i/o subsystem.

The majority of your latency is between opening files. I highly recommend the second file mapping.

Best Wishes,
-David Delaune

Re: _findfirst and fopen very slow

cristiapi1-Feb-19 0:29

1-Feb-19 0:29

I tried with 1 file mapping only; it works but the speed is exactly the same.

Probably the only way is to merge all the file in one big file. That way I can also optimize the file format for my needs.

Thanks you all.

Re: _findfirst and fopen very slow

leon de boer31-Jan-19 14:54

31-Jan-19 14:54

The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want. Another obvious choice is have the files on a ramdisk as there isn't much data.

The whole process seems a bit backward to me you are working on the reading code not the writing code.

In vino veritas

modified 31-Jan-19 21:00pm.

Re: _findfirst and fopen very slow

Stefan_Lang31-Jan-19 21:36

31-Jan-19 21:36

That's a good approach, but as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it. A specific kind of filename wouldn't be enough.

Your suggestion to include the writing into the problem solution is a good idea. However, if we do that, we might as well write all the data to a database. Retrieving the correct lines would then only require a simple SQL query.

GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

Re: _findfirst and fopen very slow

leon de boer1-Feb-19 4:42

1-Feb-19 4:42

You only need one instance to flag it as special who cares how many times it writes the special sequence after that. The process is to eliminate the mass of files that aren't of any interest by using the name.

I am not changing anything other than the name of the file .. hardly complex or rocket science and much easier and much faster than a database connection Smile | :)

Smile | :)

In vino veritas

Re: _findfirst and fopen very slow

Stefan_Lang1-Feb-19 4:59

1-Feb-19 4:59

I did not see the OP mention how many of the files do have that symbol. If a significant fraction of the files are affected, your solution would not help a lot.

leon de boer wrote:
much faster than a database connection

.. to implement, sure. But certainly not to execute. Wink | ;)

Wink | ;)

GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

Re: _findfirst and fopen very slow

cristiapi1-Feb-19 5:44

1-Feb-19 5:44

Stefan_Lang wrote:
as I understand it the main task is not to find files that contain the string, but find all lines within these files containing it.

Each file usually contains 190 to 220 lines and a file may or may not contain the wanted line (but almost all the files contain the wanted line). If the wanted line is in the file, there is only one line.

Re: _findfirst and fopen very slow

cristiapi1-Feb-19 6:24

1-Feb-19 6:24

leon de boer wrote:
The more obvious answer is get whatever is saving the files to put the ones which contain your string ABCD out under a special name string. Then you don't have to search inside the file at all to find the files you want.

What happens if I need to find "BCDE" or "S1 " or "01FA" or ...?

Another obvious choice is have the files on a ramdisk as there isn't much data.

If I put the folder on the SSD, the process is much faster: 60.9 s for the HD and 2.9 s for the SSD. The SSD is an "unusual" location for that folder because all the other files are on the HD, but it's the easiest solution.
Thank you

Re: _findfirst and fopen very slow

leon de boer1-Feb-19 21:40

1-Feb-19 21:40

Quote:
What happens if I need to find "BCDE" or "S1 " or "01FA" or ...?

Label them differently with a special name obviously, all you are doing is coming up with a filenaming convention Smile | :)

Smile | :)

Hell use the file extension you already have (*.states) and mask the bits of it for what special strings are in it

*.states = file with no special tags
*.states1 = file with special tag 1 in it
*.states2 = file with special tag 2 in it
*.states3 = file with special tag 1 & 2 in it
*.states4 = file with special tag 3 in it
*.states5 = file with special tag 1 & 3 in it
*.states6 = file with special tag 1 & 2 in it
*.states7 = file with special tag 1, 2 & 3 in it

You can know what tags are in the file without ever opening it all you need to know is the filename.

This is also obviously a windows program why aren't you using the Windows API for the file open and reading?

HANDLE Handle = CreateFile (fd.name, GENERIC_READ, FILE_SHARE_READ, 
		0,  OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); // Open the file
if (Handle != INVALID_HANDLE_VALUE)
{
    DWORD Actual;
    ReadFile(Handle, &buf[0], sizeof(buf)-1, &Actual, 0); // Read a buffer-1 of data (1 byte for #0 at end)
    if (Actual > 0)
    {
       buf[Actual] = 0;   // Make sure asciiz terminated for next string op
       if(0 == _strnicmp(buf, "ABCD", 4)) {
            Save buf in a std::vector
       }
    }
    CloseHandle(Handle);
}

In vino veritas

modified 2-Feb-19 3:59am.

Re: _findfirst and fopen very slow

cristiapi1-Feb-19 23:35

1-Feb-19 23:35

Since almost all the files contain the wanted string, I'll need to open almost all the files, so the speed up would be negligible.

I don't use Win Api because, afaik, there is no fgets() equivalent and there is no speed up if I read the whole file at once.

Re: _findfirst and fopen very slow

leon de boer2-Feb-19 0:33

2-Feb-19 0:33

I gave you the fgets equivalent above (its only a couple of lines of code) .. I am not convinced it isn't faster because you will be using the standard console file handler for opening and reading thru the standard library.

Anyhow I will leave you to it

In vino veritas

General News Suggestion Question Bug Answer Joke Praise Rant Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.