Introduction
Whilst still a beginner-level task, loading an entire file into memory in C is not entirely easy. One problem is you have to grown the buffer dynamically, because some streams are not seekable, and therefore it is impossible to determine how much data is in the file. Then this is a situation where memory allocation failure has to be treated as a real rather then theoretical possibility, as some files can get quite large.
Using the code
Snip and paste these two functions
static char *fslurp(FILE *fp)
{
char *answer;
char *temp;
int buffsize = 1024;
int i = 0;
int ch;
answer = malloc(1024);
if(!answer)
return 0;
while( (ch = fgetc(fp)) != EOF )
{
if(i == buffsize-2)
{
if(buffsize > INT_MAX - 100 - buffsize/10)
{
free(answer);
return 0;
}
buffsize = buffsize + 100 * buffsize/10;
temp = realloc(answer, buffsize);
if(temp == 0)
{
free(answer);
return 0;
}
answer = temp;
}
answer[i++] = (char) ch;
}
answer[i++] = 0;
temp = realloc(answer, i);
if(temp)
return temp;
else
return answer;
}
We have the canonical fgetc()
loop. And we grow the buffer when we run out of memory. And we use a temporary pointer in case we run out of memory, to avoid a leak. And we sanity test the input, limiting it to slightly under INT_MAX
. This means that a string can be indexed by an int
, and that avoids a lot of subtle bugs.
Of course you can disable this test if the strings are encyclopedias or other massive inputs. But you will know when you are dealing with such data. This is the function for general use. We write the function to take a FILE *
rather than a path. The most likely error is failure to open the stream with fopen, and user code is best equipped to handle that. And you can write a wrapper to take a filename very easily. And some data comes in streams. So writing the function to take a FILE *
is the better approach.
static unsigned char *fslurpb(FILE *fp, int *len)
{
unsigned char *answer = 0;
unsigned char *temp;
int capacity = 1024;
int N = 0;
int ch;
answer = malloc(capacity);
if (!answer)
goto out_of_memory;
while ( (ch = fgetc(fp)) != EOF)
{
answer[N++] = ch;
if (N >= capacity)
{
if (capacity > INT_MAX/2)
goto out_of_memory;
temp = realloc(answer, capacity + capacity / 2);
if (!temp)
goto out_of_memory;
answer = temp;
capacity = capacity + capacity / 2;
}
}
*len = N;
return answer;
out_of_memory:
*len = -1;
free(answer);
return 0;
}
This is very similar to the previous function, but it uses a rather more aggressive approach to growing the buffer. And because you cannot obtain the length by looking for a null byte, it has to take a length return parameter. It's hard to know which strategy is best, and often realloc()
after the first reallocation, puts the block in a location where it can expand indefinitely and so further reallocations are very cheap.
Don't hesitate to use goto
as an out of memory handler. Technically it is not a structured function as a result. But it is neater than using lots of if
statements to treat out-of-memory conditions as part of normal control flow.
Points of Interest
You can actually put the reallocation strategy on a noir firm mathematical footing by looking at the statistical distribution of file sizes, which is a log-normal distribution. The means and medians are different for PC, Linux, and Mac, however.
History
I try to keep these functions handy. You never know when you will reach for them.