From your Regex pattern
const string BLAST_QUERY_HIT_SECTION = "Query=.+?Effective search space used: \d+";
(Yes I converted to C#)
I'd guess that your file format is something like:
possibly some stuff in the line ahead of Query=something query identifier-ish
lots (multiple? lines) of query result stuff ...
possibly some stuff in the line ahead of Effective search space used: some digits
So why not write a fairly simple loop that looks at the Blast log file a line at a time and assembles the query results
as the pieces?
public static IEnumerable<string> BlastQueryHits(string logFile)
{
bool inHit = false;
StringBuilder sb = new StringBuilder();
foreach (string line in File.ReadLines(logFile))
{
if (inHit)
{
sb.Append(line);
if (line.StartsWith("Effective search space used:"))
{
yield return sb.ToString();
inHit = false;
}
}
else if (line.StartsWith("Query="))
{
sb.Clear().Append(line);
inHit = true;
}
}
if (inHit)
yield return sb.ToString();
}
This will return each of the query hits as a string.
They will be assembled
as needed, so there's no need to have everything in memory at the same time.
Work with them one at a time!!!
Caveat: This code is "off the top of my head"... I didn't compile or execute it. It might have an "issue" or two, but should be pretty close! ;-)
Edit: revised based on comments below.
The way this could be used in a parallel processing scenario would be with something
like:
string logFile = "path to blast log file";
Parallel.ForEach(BlastQueryHits(logFile), DoSomethingWithOneQuery);
private void DoSomethingWithOneQuery(string queryHit)
{
}
This will still produce the query hits only
as needed. So if it parallelizes over 3 cores, it will produce the first 3 right away and then the next ones, as the previous ones are completed. But it still will not produce them all at once, leaving them sitting around in memory until the processing gets to them. This should have a
substantially lower memory footprint.
The BlastQueryHits() method is equivalent to your Regex.Match, but it is much more efficient and will work with an arbitrarily large file (as long as each chunk is less than the 2GB limit for strings).