Ultra large text file parsing(size more that 100GB)

Question

2.60/5 (2 votes)

See more:

Hey guys, help!!!

Recently I have done a genome annotation job for a bacteria genome and now one of the annotation result is NCBI blastp output log text file, it is in size of 100GB or maybe larger in the later time, i have wrote a class for parsing the blastp output file before, but the text file size I deal with with this class before this time just in size below 2GB.

Now facing with the text file in size more that 100GB, the .NET Regular Expression and StringBuilder class object can not handle such a huge size string any more, it throw me an OutOfMemory Exception on our 1TB memory Linux server!

Who can help me....... I know this is very crazy..

Here is the loading code, base on the exception information, the program stuck at here:

VB

Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

Dim SourceText As String = IO.File.ReadAllText(LogFile) 
Dim Sections As String() = (From matche As Match
                            In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                            Select matche.Value).ToArray

The blast output file is consist of many many section like this, i think it is easy to understand using a regular expression to parsing it and the regular expression makes the code clean:

blablablabla.........

Query= XC_0118 transcriptional regulator

Length=1113
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  lcl5167|ana:all4503 two-component response regulator; K07657 tw...  57.4    2e-009
  lcl2658|cyp:PCC8801_3460 winged helix family two component tran...  56.6    4e-009
  lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ...  55.1    7e-009
  lcl9057|kol:Kole_0706 two component transcriptional regulator, ...  55.1    1e-008
  lcl9114|ter:Tery_2902 two component transcriptional regulator; ...  55.1    1e-008
  lcl9051|trq:TRQ2_0821 two component transcriptional regulator (...  54.7    1e-008
  lcl9023|tpt:Tpet_0798 two component transcriptional regulator (...  54.7    1e-008
  lcl8929|tma:TM0126 response regulator; K02483 two-component sys...  54.3    1e-008
  lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The...  52.4    6e-008


blablablabla.........


> lcl5167|ana:all4503 two-component response regulator; K07657 
two-component system, OmpR family, phosphate regulon response 
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253

 Score = 57.4 bits (137),  Expect = 2e-009, Method: Compositional matrix adjust.
 Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)

Query  3    LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE  62
            LR +R+  L  +P  +   + + P+   ++  G+ + L P+   +L      A  V S E
Sbjct  144  LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE  203

Query  63   QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL  110
            QLL  VW   F GD+  V   I  LR KL  D   P +I T+R  GYR 
Sbjct  204  QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF  252


blablablabla.........


Lambda      K        H        a         alpha
   0.321    0.133    0.395    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6 

Effective search space used: 1655396995

blablablabla.........

Posted 13-Nov-14 8:48am

Mr. xieguigang 谢桂纲

Updated 13-Nov-14 21:41pm

v4

Add a Solution

Comments

Tomas Takac 13-Nov-14 15:00pm

Interesting, how do you parse the input? Can you post some code?

Mr. xieguigang 谢桂纲 13-Nov-14 15:13pm

The text file was read using method IO.File.ReadAllText
and split using Regex.Matches method.

Tomas Takac 13-Nov-14 15:20pm

Regex, at lease the .NET implementation won't help you here. Are there any line breaks in your input? So you can parse it line by line, assuming the matched string doesn't span multiple lines. Or you can write your own parser, using ANTLR for example.

BillWoodruff 13-Nov-14 15:04pm

Well, you're going to have to read a "chunk" at a time. However, if you have a 1 terabyte server: why not run a program on the server to serve-up a chunk whose size you can deal with as you request chunks ?

Mr. xieguigang 谢桂纲 13-Nov-14 15:15pm

the blastp output file was consist with result sections and the sections is in various length, it is difficult to decided the length of the chunk, so i think the best method is using regular expression....

BillWoodruff 13-Nov-14 16:11pm

I don't know anything about the BLAST format other than that it is binary and complex, but I suspect that there may be tools for parsing it; have you searched for tools that might let you request a meaningful chunk ?

Like: http://darcs.idyll.org/~t/projects/blastparser/blastparser.py

Mr. xieguigang 谢桂纲 13-Nov-14 16:26pm

The blast output log is in the formatted plant text and Perl script have a package called bioperl for handle the blast output log and i have use it before, but it can not be parallel, and it is too slow, so i write a blast parser using VisualBasic and Parallel LINQ, the parsing job is quite fast on the linux server, but the .NET framework limit the string length below 2GB, and I have no idea on dealing with the 100GB size this time.

PIEBALDconsult 13-Nov-14 15:24pm

I would definitely split the file into smaller chucks and then handle each chuck separately.
I have done that for CSV and XML files.
Exactly how to do that depends on the format of the file.

Mr. xieguigang 谢桂纲 13-Nov-14 16:29pm

Yeah, I trying to write a alternative version to deal in a smaller chunk follow your advice.

Matt T Heffron 13-Nov-14 16:03pm

You'll have to split the file into smaller pieces.
These pieces should be determined by a preliminary parsing stage, reading just a line at a time and determining when the "chunk" is done, and when the next one starts.

I also suggest doing a Google query on BLAST format.
There are quite a number of hits with strategies for parsing the output.

Mr. xieguigang 谢桂纲 13-Nov-14 16:34pm

oh Yeah, I trying to write a alternative version to deal in a smaller chunk:

1. read a chunk in 1GB
2. and then using regular expression to parsing the section
3. then index the last section and split the last section string
4. read next chunk and then combine the last section string with it and then do step 2

but this can't be Parallel any more, when facing with the 100GB size or larger file, it will takes a long long long time to parsing....

Sergey Alexandrovich Kryukov 13-Nov-14 16:50pm

Is the file and its format give, or can you change it?
—SA

Mr. xieguigang 谢桂纲 14-Nov-14 3:45am

The output format of the blastp program have HTML, plant text and xml format, and the plant text format generate the smallest file than HTML and xml. So we chose the plant text format output for parsing the result. I‘m unable to change the file format because the blastp program have no parameter to tweaking the output format...

Sergey Alexandrovich Kryukov 14-Nov-14 11:20am

I would create a special file format for this case, binary, with indexing. What you are doing just doesn't look serious.
—SA

Kornfeld Eliyahu Peter 13-Nov-14 17:02pm

Try implement something like sliding window...
1. Read a chunk (the size depends on some predict of the size of the data you are looking for)
2. Search for your keywords
3. If found store the position as the top of the sliding window and go back to 1
4. If not found go for the next window (back to 1)

Mr. xieguigang 谢桂纲 14-Nov-14 3:47am

i not sure the size of a section is, when the protein is hit the section text size maybe 1GB size ,when the protein have no hits then only 1KB of the text size in a section..... so maybe a chunk contains many section or only contains part of the section....

[no name] 13-Nov-14 17:17pm

Great! Finally a (maybe unusuall) praxis example which is far away from the theoretical School examples!

IO.File.ReadAllText is a bad idea if you have to assume that a file has "unlimited" size.

My 5 for your question.
Bruno

Mr. xieguigang 谢桂纲 14-Nov-14 3:50am

yes, the IO.File.ReadAllText dealing with the file with size below 2GB is perfect and clean, but when dealing the size very large, it crash. i think the MS should improved the .NET class object for the ultra large size text file processing.

Sergey Alexandrovich Kryukov 14-Nov-14 11:25am

This is as naive as saying "Toyota should improve vehicle driving on off-road condition by making the cars levitate". What should be improved is your thinking. You should understand where to use "simplified spells for the weak-minded" (like ReadAllText) and where to develop a special approach. Isn't hat obvious why "it crash"? The problem is quite solvable if your take it a bit more fundamentally.
—SA

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Matt T Heffron · Accepted Answer · 2014-11-13T11:02:00

Solution 1

From your Regex pattern

C#

const string BLAST_QUERY_HIT_SECTION = "Query=.+?Effective search space used: \d+";

(Yes I converted to C#)
I'd guess that your file format is something like:

possibly some stuff in the line ahead of Query=something query identifier-ish
lots (multiple? lines) of query result stuff ...
possibly some stuff in the line ahead of Effective search space used: some digits

So why not write a fairly simple loop that looks at the Blast log file a line at a time and assembles the query results as the pieces?

C#

public static IEnumerable<string> BlastQueryHits(string logFile)
{
  bool inHit = false;
  StringBuilder sb = new StringBuilder();
  foreach (string line in File.ReadLines(logFile))
  {
    if (inHit)
    {
      sb.Append(line);
      if (line.StartsWith("Effective search space used:"))  // from discussion comment below, this is more efficient
      {
        yield return sb.ToString();
        inHit = false;
      }
    }
    else if (line.StartsWith("Query="))  // from discussion comment below, this is more efficient
    {
      sb.Clear().Append(line);
      inHit = true;
    }
  }
  if (inHit)
     yield return sb.ToString();   // just in case there's any "leftovers"
}

This will return each of the query hits as a string.
They will be assembled as needed, so there's no need to have everything in memory at the same time.
Work with them one at a time!!!

Caveat: This code is "off the top of my head"... I didn't compile or execute it. It might have an "issue" or two, but should be pretty close! ;-)

Edit: revised based on comments below.

The way this could be used in a parallel processing scenario would be with something like:

C#

string logFile = "path to blast log file";
Parallel.ForEach(BlastQueryHits(logFile), DoSomethingWithOneQuery);

private void DoSomethingWithOneQuery(string queryHit)
{
  // do the per-query-hit processing
}

This will still produce the query hits only as needed. So if it parallelizes over 3 cores, it will produce the first 3 right away and then the next ones, as the previous ones are completed. But it still will not produce them all at once, leaving them sitting around in memory until the processing gets to them. This should have a substantially lower memory footprint.
The BlastQueryHits() method is equivalent to your Regex.Match, but it is much more efficient and will work with an arbitrarily large file (as long as each chunk is less than the 2GB limit for strings).

Posted 13-Nov-14 11:02am

Matt T Heffron

Updated 14-Nov-14 14:05pm

v5

Comments

Mr. xieguigang 谢桂纲 14-Nov-14 3:57am

here is the example of the section: each section start from "Query=" and end with Effective search space used:

blablablabla.........

Query= XC_0118 transcriptional regulator

Length=1113
Score E
Sequences producing significant alignments: (Bits) Value

lcl5167|ana:all4503 two-component response regulator; K07657 tw... 57.4 2e-009
lcl2658|cyp:PCC8801_3460 winged helix family two component tran... 56.6 4e-009
lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ... 55.1 7e-009
lcl9057|kol:Kole_0706 two component transcriptional regulator, ... 55.1 1e-008
lcl9114|ter:Tery_2902 two component transcriptional regulator; ... 55.1 1e-008
lcl9051|trq:TRQ2_0821 two component transcriptional regulator (... 54.7 1e-008
lcl9023|tpt:Tpet_0798 two component transcriptional regulator (... 54.7 1e-008
lcl8929|tma:TM0126 response regulator; K02483 two-component sys... 54.3 1e-008
lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The... 52.4 6e-008

blablablabla.........

> lcl5167|ana:all4503 two-component response regulator; K07657
two-component system, OmpR family, phosphate regulon response
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253

Score = 57.4 bits (137), Expect = 2e-009, Method: Compositional matrix adjust.
Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)

Query 3 LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE 62
LR +R+ L +P + + + P+ ++ G+ + L P+ +L A V S E
Sbjct 144 LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE 203

Query 63 QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL 110
QLL VW F GD+ V I LR KL D P +I T+R GYR
Sbjct 204 QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF 252

blablablabla.........

Lambda K H a alpha
0.321 0.133 0.395 0.792 4.96

Gapped
Lambda K H a alpha sigma
0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 1655396995

blablablabla.........

and yes, this may be a solution, but it can not be parallel, and if we using a for each loop, then the program only utilize 1 CPU core, dealing with the 100GB text file, is impossible.....

Matt T Heffron 14-Nov-14 12:47pm

I have updated the Solution to show how this could be used in a parallel scenario.
Also, see my other discussion comment here.

Matt T Heffron 14-Nov-14 12:38pm

My solution was to the issue of breaking the HUGE file into chunks (query hits). That cannot be done in parallel anyway.
What can be done in parallel is whatever processing you want to do with each of the query hits.
It will still "play well" with parallel processing of the hits, as it will return the query hit strings only as needed and they will not be "sitting around" in memory (in an array) waiting to be processed. And once processed, then can be released; again, not sitting in an array when no longer necessary.

BillWoodruff 14-Nov-14 13:00pm

+5 Matt, I appreciate the effort you've gone to in your thoughtful reply ! If the file format contained discrete "entities" that did not have forwards/backwards dependencies, then I see no reason why it could not be processed in parallel, queried in parallel, etc., or at least pre-processed in some way.

Matt T Heffron 14-Nov-14 13:11pm

Thanks.
The method I wrote just breaks the HUGE file into the individual query hits, just like the OP's Regex.Match, except it produces them as needed, instead of all at once.

Maciej Los 14-Nov-14 14:22pm

+5!

Mr. xieguigang 谢桂纲 15-Nov-14 2:33am

Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps:

1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list
2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job
3. do the section text parsing job as i does before.

here is my code:

'''

''' It seems 786MB possibly is the up bound of the Utf8.GetString function.
'''

''' <remarks>
Const CHUNK_SIZE As Long = 1024 * 1024 * 786
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

'''

''' Dealing with the file size large than 2GB
'''

''' <param name="LogFile"></param>
''' <returns>
''' <remarks>
Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228
Call Console.WriteLine("Regular Expression parsing blast output...")

'The default text encoding of the blast log is utf8
If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8

Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
Call p.Start()

Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open)
Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {}
Dim LastIndex As String = ""
'Dim Sections As List(Of String) = New List(Of String)
Dim SectionChunkBuffer As List(Of String) = New List(Of String)

Do While TextReader.Position < TextReader.Length
Dim Delta As Integer = TextReader.Length - TextReader.Position

If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {}

Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1)

Dim SourceText As String = Encoding.GetString(ChunkBuffer)

If Not String.IsNullOrEmpty(LastIndex) Then
SourceText = LastIndex & SourceText
End If

Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:")
If i_LastIndex = -1 Then '当前区间之中没有一个完整的Section
LastIndex &= SourceText
Continue Do
Else
i_LastIndex += 42

If Not i_LastIndex >= Len(SourceText) Then
LastIndex = Mid(SourceText, i_LastIndex) 'There are some text in the last of this chunk is the part of the section in the next chunk.
Else
LastIndex = ""
End If
Call SectionChunkBuffer.Add(SourceText)
End If

'This part of the code is non-parallel

'Dim SectionsTempChunk = (From matche As Match
' In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
' Select matche.Value).ToArray

'If SectionsTempChunk.IsNullOrEmpty Then
' LastIndex &= SourceText
' Continue Do
'Else
' Call Sections.AddRange(SectionsTempChunk)
'End If

'LastIndex = SectionsTempChunk.Last()
'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(La

**Zoltán Zörgő** · Accepted Answer · 2014-11-13T11:45:00

Even if chunks are not of predefined size, you should find a method to split the file to chunks - or to get the position of every chunk limit.
Still if this is a lab application, I would not use c#. I would use python for example. It has more sophisticated text manipulation techniques implemented.

On the other hand, what you can do by loading everything into the memory you can do also jumping around the file. Won't be quick, but will work. In your case matching the pattern "Query=.+?Effective search space used: \d+" is not complex at all. You need a simple pattern matching algorythm. Matt T Heffron's answer contains one approach, but might not be "low level" enough in some cases.
Even better, your patern is a "type 2 language statement", thus you can look for it with a simple state machine implementation. Consider the file as a simgle array of characters. If you see it this way, you don't need to load it in the memory. If you want to speed it up, you can load a page of fixed size and offset it or you can try using MemoryMappedFile[^], since this is what's this made for.

Mr. xieguigang 谢桂纲 · Accepted Answer · 2014-11-14T20:35:00

Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps:

1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list
2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job
3. do the section text parsing job as i does before.

here is my code:

VB

''' <summary>
''' size smaller than 1GB
''' </summary>
''' <param name="LogFile">Original plant text file path of the blast output file.</param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function TryParse(LogFile As String) As v228
   Call Console.WriteLine("Regular Expression parsing blast output...")

   Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
      Call p.Start()

      Dim SourceText As String = IO.File.ReadAllText(LogFile) 'LogFile.ReadUltraLargeTextFile(System.Text.Encoding.UTF8)
      Dim Sections As String() = (From matche As Match
                                  In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                                  Select matche.Value).ToArray

      Call Console.WriteLine("Parsing job done!")

      Dim Sw As Stopwatch = Stopwatch.StartNew
#If DEBUG Then
      Dim LQuery = (From Line As String In Sections Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#Else
      Dim LQuery = (From Line As String In Sections.AsParallel Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#End If
      Dim BLASTOutput As v228 = New v228 With {.FilePath = LogFile & ".xml", .Queries = LQuery}
      Console.WriteLine("BLASTOutput file loaded: {0}ms", Sw.ElapsedMilliseconds)

      Return BLASTOutput
   End Using
End Function

''' <summary>
''' It seems 786MB possibly is the up bound of the Utf8.GetString function.
''' </summary>
''' <remarks></remarks>
Const CHUNK_SIZE As Long = 1024 * 1024 * 786
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

''' <summary>
''' Dealing with the file size large than 2GB
''' </summary>
''' <param name="LogFile"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228
    Call Console.WriteLine("Regular Expression parsing blast output...")

    'The default text encoding of the blast log is utf8
    If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8

    Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
       Call p.Start()

       Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open)
       Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {}
       Dim LastIndex As String = ""
       'Dim Sections As List(Of String) = New List(Of String)
       Dim SectionChunkBuffer As List(Of String) = New List(Of String)

       Do While TextReader.Position < TextReader.Length
          Dim Delta As Integer = TextReader.Length - TextReader.Position

          If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {}

          Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1)

          Dim SourceText As String = Encoding.GetString(ChunkBuffer)

          If Not String.IsNullOrEmpty(LastIndex) Then
             SourceText = LastIndex & SourceText
          End If

          Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:")
          If i_LastIndex = -1 Then  '当前区间之中没有一个完整的Section
             LastIndex &= SourceText
             Continue Do
          Else
             i_LastIndex += 42

             If Not i_LastIndex >= Len(SourceText) Then
                LastIndex = Mid(SourceText, i_LastIndex)  'There are some text in the last of this chunk is the part of the section in the next chunk.
             Else
                LastIndex = ""
             End If
             Call SectionChunkBuffer.Add(SourceText)
          End If

          'This part of the code is non-parallel

          'Dim SectionsTempChunk = (From matche As Match
          '                         In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
          '                         Select matche.Value).ToArray

          'If SectionsTempChunk.IsNullOrEmpty Then
          '    LastIndex &= SourceText
          '    Continue Do
          'Else
          '    Call Sections.AddRange(SectionsTempChunk)
          'End If

          'LastIndex = SectionsTempChunk.Last()
          'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(LastIndex) + 1
          'If Not Last_idx >= Len(SourceText) Then
          '    LastIndex = Mid(SourceText, Last_idx)  'There are some text in the last of this chunk is the part of the section in the next chunk.
          'Else
          '    LastIndex = ""
          'End If
      Loop

      Call Console.WriteLine("Loading job done, start to regex parsing!")

      'The regular expression parsing function just single thread, here using parallel to parsing the cache data can speed up the regular expression parsing job when dealing with the ultra large text file.
       Dim Sections As String() = (From strLine As String 
                                   In SectionChunkBuffer.AsParallel
                                   Select (From matche As Match
                                           In Regex.Matches(strLine, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                                           Select matche.Value).ToArray).ToArray.MatrixToVector

       Call Console.WriteLine("Parsing job done!")
       '#Const DEBUG = 1
       Dim Sw As Stopwatch = Stopwatch.StartNew
#If DEBUG Then
       Dim LQuery = (From Line As String In Sections Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#Else
       Dim LQuery = (From Line As String In Sections.AsParallel Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#End If
       Dim BLASTOutput As v228 = New v228 With {.FilePath = LogFile & ".xml", .Queries = LQuery}
       Console.WriteLine("BLASTOutput file loaded: {0}ms", Sw.ElapsedMilliseconds)

       Return BLASTOutput
   End Using
End Function

Ultra large text file parsing(size more that 100GB)

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0