16,004,557 members
Sign in
Sign in
Email
Password
Forgot your password?
Sign in with
home
articles
Browse Topics
>
Latest Articles
Top Articles
Posting/Update Guidelines
Article Help Forum
Submit an article or tip
Import GitHub Project
Import your Blog
quick answers
Q&A
Ask a Question
View Unanswered Questions
View All Questions
View C# questions
View C++ questions
View Javascript questions
View Visual Basic questions
View .NET questions
discussions
forums
CodeProject.AI Server
All Message Boards...
Application Lifecycle
>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
Artificial Intelligence
ASP.NET
JavaScript
Internet of Things
C / C++ / MFC
>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices
>
System Admin
Hosting and Servers
Java
Linux Programming
Python
.NET (Core and Framework)
Android
iOS
Mobile
WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
CodeProject Stuff
community
lounge
Who's Who
Most Valuable Professionals
The Lounge
The CodeProject Blog
Where I Am: Member Photos
The Insider News
The Weird & The Wonderful
help
?
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
About Us
Search within:
Articles
Quick Answers
Messages
Comments by Mr. xieguigang 谢桂纲 (Top 15 by date)
Mr. xieguigang 谢桂纲
6-May-15 11:32am
View
Control.BeginInvoke(MethodInvoker) is still not working... the code is still stuck running at here
Mr. xieguigang 谢桂纲
15-Nov-14 2:33am
View
Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps:
1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list
2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job
3. do the section text parsing job as i does before.
here is my code:
'''
''' It seems 786MB possibly is the up bound of the Utf8.GetString function.
'''
''' <remarks>
Const CHUNK_SIZE As Long = 1024 * 1024 * 786
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"
'''
''' Dealing with the file size large than 2GB
'''
''' <param name="LogFile"></param>
''' <returns>
''' <remarks>
Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228
Call Console.WriteLine("Regular Expression parsing blast output...")
'The default text encoding of the blast log is utf8
If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8
Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
Call p.Start()
Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open)
Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {}
Dim LastIndex As String = ""
'Dim Sections As List(Of String) = New List(Of String)
Dim SectionChunkBuffer As List(Of String) = New List(Of String)
Do While TextReader.Position < TextReader.Length
Dim Delta As Integer = TextReader.Length - TextReader.Position
If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {}
Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1)
Dim SourceText As String = Encoding.GetString(ChunkBuffer)
If Not String.IsNullOrEmpty(LastIndex) Then
SourceText = LastIndex & SourceText
End If
Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:")
If i_LastIndex = -1 Then '当前区间之中没有一个完整的Section
LastIndex &= SourceText
Continue Do
Else
i_LastIndex += 42
If Not i_LastIndex >= Len(SourceText) Then
LastIndex = Mid(SourceText, i_LastIndex) 'There are some text in the last of this chunk is the part of the section in the next chunk.
Else
LastIndex = ""
End If
Call SectionChunkBuffer.Add(SourceText)
End If
'This part of the code is non-parallel
'Dim SectionsTempChunk = (From matche As Match
' In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
' Select matche.Value).ToArray
'If SectionsTempChunk.IsNullOrEmpty Then
' LastIndex &= SourceText
' Continue Do
'Else
' Call Sections.AddRange(SectionsTempChunk)
'End If
'LastIndex = SectionsTempChunk.Last()
'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(La
Mr. xieguigang 谢桂纲
14-Nov-14 4:06am
View
I trying your advice of split the big file into a chunk and then search in each chunk, maybe I can find a solution tonight. the difficulty of this job is how to make the loading process parallel or it will maybe takes whole day on this loading job....
Mr. xieguigang 谢桂纲
14-Nov-14 3:57am
View
here is the example of the section: each section start from "Query=" and end with Effective search space used:
blablablabla.........
Query= XC_0118 transcriptional regulator
Length=1113
Score E
Sequences producing significant alignments: (Bits) Value
lcl5167|ana:all4503 two-component response regulator; K07657 tw... 57.4 2e-009
lcl2658|cyp:PCC8801_3460 winged helix family two component tran... 56.6 4e-009
lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ... 55.1 7e-009
lcl9057|kol:Kole_0706 two component transcriptional regulator, ... 55.1 1e-008
lcl9114|ter:Tery_2902 two component transcriptional regulator; ... 55.1 1e-008
lcl9051|trq:TRQ2_0821 two component transcriptional regulator (... 54.7 1e-008
lcl9023|tpt:Tpet_0798 two component transcriptional regulator (... 54.7 1e-008
lcl8929|tma:TM0126 response regulator; K02483 two-component sys... 54.3 1e-008
lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The... 52.4 6e-008
blablablabla.........
> lcl5167|ana:all4503 two-component response regulator; K07657
two-component system, OmpR family, phosphate regulon response
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253
Score = 57.4 bits (137), Expect = 2e-009, Method: Compositional matrix adjust.
Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)
Query 3 LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE 62
LR +R+ L +P + + + P+ ++ G+ + L P+ +L A V S E
Sbjct 144 LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE 203
Query 63 QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL 110
QLL VW F GD+ V I LR KL D P +I T+R GYR
Sbjct 204 QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF 252
blablablabla.........
Lambda K H a alpha
0.321 0.133 0.395 0.792 4.96
Gapped
Lambda K H a alpha sigma
0.267 0.0410 0.140 1.90 42.6 43.6
Effective search space used: 1655396995
blablablabla.........
and yes, this may be a solution, but it can not be parallel, and if we using a for each loop, then the program only utilize 1 CPU core, dealing with the 100GB text file, is impossible.....
Mr. xieguigang 谢桂纲
14-Nov-14 3:50am
View
yes, the IO.File.ReadAllText dealing with the file with size below 2GB is perfect and clean, but when dealing the size very large, it crash. i think the MS should improved the .NET class object for the ultra large size text file processing.
Mr. xieguigang 谢桂纲
14-Nov-14 3:47am
View
i not sure the size of a section is, when the protein is hit the section text size maybe 1GB size ,when the protein have no hits then only 1KB of the text size in a section..... so maybe a chunk contains many section or only contains part of the section....
Mr. xieguigang 谢桂纲
14-Nov-14 3:45am
View
The output format of the blastp program have HTML, plant text and xml format, and the plant text format generate the smallest file than HTML and xml. So we chose the plant text format output for parsing the result. I‘m unable to change the file format because the blastp program have no parameter to tweaking the output format...
Mr. xieguigang 谢桂纲
13-Nov-14 16:34pm
View
oh Yeah, I trying to write a alternative version to deal in a smaller chunk:
1. read a chunk in 1GB
2. and then using regular expression to parsing the section
3. then index the last section and split the last section string
4. read next chunk and then combine the last section string with it and then do step 2
but this can't be Parallel any more, when facing with the 100GB size or larger file, it will takes a long long long time to parsing....
Mr. xieguigang 谢桂纲
13-Nov-14 16:29pm
View
Yeah, I trying to write a alternative version to deal in a smaller chunk follow your advice.
Mr. xieguigang 谢桂纲
13-Nov-14 16:26pm
View
The blast output log is in the formatted plant text and Perl script have a package called bioperl for handle the blast output log and i have use it before, but it can not be parallel, and it is too slow, so i write a blast parser using VisualBasic and Parallel LINQ, the parsing job is quite fast on the linux server, but the .NET framework limit the string length below 2GB, and I have no idea on dealing with the 100GB size this time.
Mr. xieguigang 谢桂纲
13-Nov-14 15:15pm
View
the blastp output file was consist with result sections and the sections is in various length, it is difficult to decided the length of the chunk, so i think the best method is using regular expression....
Mr. xieguigang 谢桂纲
13-Nov-14 15:13pm
View
The text file was read using method IO.File.ReadAllText
and split using Regex.Matches method.
Mr. xieguigang 谢桂纲
16-Oct-13 1:10am
View
no, i think you have misunderstand my code, the process is an object instance as i declare:
Dim Process As System.Diagnostics.Process = New System.Diagnostics.Process()
and the static function generate a process startinfo object instance:
Process.StartInfo = Global.c2.LocalBLAST.CommandLines.FormatDb(FASTA, FileType)
and no any coding error in the program, the problem is just the shell to call the external command on the linux and windows is diferrent.
Mr. xieguigang 谢桂纲
16-Oct-13 1:02am
View
yes, mono 2.1 on Ubnuntu 13.04
Mr. xieguigang 谢桂纲
10-Oct-13 7:14am
View
just define a class follow the offical instruction pdf
Namespace MetaCyc.File.SBML
<system.xml.serialization.xmlroot("sbml")>
Public Class Xml
<system.xml.serialization.xmlattribute> Public Property version As Integer
<system.xml.serialization.xmlattribute> Public Property level As Integer
<system.xml.serialization.xmlelement> Public Model As ModelF
Public Shared Widening Operator CType(Path As String) As LANS.SystemsBiology.Assembly.MetaCyc.File.SBML.Xml
'从文件反序列化
Using fs As New System.IO.FileStream(Path, System.IO.FileMode.OpenOrCreate)
Return DirectCast(New System.Xml.Serialization.XmlSerializer(GetType(Xml)).Deserialize(fs), Xml)
End Using
End Operator
End Class
End Namespace
Show More