Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / HPC / parallel-processing

GENOME-IN-CODE Project

4.71/5 (15 votes)
1 Jun 2015GPL317 min read 33.6K   237  
Introduction to GENOME-IN-CODE Project about virtual cell modelling the bacteria Xcc 8004, visit GCModeller.org for the latest news about GCModeller

A new fashion in the molecular biology research

There comes a new fashion in the molecular research these years: a lot of molecular researcher willing combine a computational model into their research article to explain their experimental data. This is can be reflect from an example that the articles about the cell system modeling to explain the high through experiment data, and the Flux balance analysis (FBA) model is the most popular model to represent the metabolism system. The reason of this phenomenon is originally comes from the computer technology development nowadays and more and more biology researcher gets the programming skills training on Perl, R, matlab, Linux shell and VisualBasic 6 to using some program utility or coding a new tool for explain their experiment data.

There is another one phenomenon: As the genetic engineering procedures is a way to reprogram a bacterial genome, but this is not the point. The point is modifying which gene will actually change the cell phenotype or creates a phenotype we wanted? Generally, create a mutant takes several weeks of laboratory works, and maybe the mutant is not the one we want. So we want a tool to predict if we create a mutant, what changes will we make on its cell function?

Here is the personal view from Markus W. Covert, who create a platform for simulate a cell using matlab:

“Understanding how complex phenotypes arise from individual molecules and their interactions is a primary challenge in biology that computational approaches are poised to tackle.” [^]

Which means the biology now is facing a big challenge: we need a powerful tool to explains what is the life? A best solution for this challenge is the virtual cell technology.

[^] Karr, J. R., et al. (2012). "A whole-cell computational model predicts phenotype from genotype." Cell 150(2): 389-401.

 

Can we reprogramming the genome?

The GCModeller is currently provides the tools for modelling the biological components and simulation tools, in the future work of the GCModeller development we will provides the systems designing tools for the synthetic biology so that we can really performance the bacteria genome reprogramming.

Image 1

Picture1. What a cell system are really looks like?

If we comparing the cell process with the .NET program assembly and its running way, then we will found out that:
Gene expression processing just like we create an object instance from specific type information in the programming, and the protein enzymes is the class instance of a gene. Then the expressed proteins will implements some phenotype function from catalyzed some metabolism pathways, and this is just like a method invoke.
So if a cell system architecture can be treat as a program assembly, and the cell components is equals to the object class instance in a .NET program, then which means we can modify the cell function process from we modify the genome information. Actually the traditional genetic engineering method is a way of genome reprogramming method, with modify the genome then create a mutant, then its cell function changed. This is just like modify the source code and compile a new assembly. So the DNA sequence just like the binary sequence of the compiled program assembly, and the molecular experiments in the laboratory is the work of disassembly.

One conclusion about the cell system: Comparing with a program running way, the cell system processes are more likely the threads in a program.

So can we just solve a biology problem from the view of the programming? For example, I'm doing the signal transduction network study work in the laboratory now, and i trying to answer 3 questions from my laboratory research job:
  • 1. For example a disassembly like problem: The HrpG/HrpX and DSF/Rpf in Xanthomonas, how does these two system module interaction with each other to performance the pathogenicity phenotype, the molecular mechanism is still unclear now and even if it is clear but we are unable to describe it as they are two complex network.
  • 2. For example a program debug like problem: It is the target gene mutation is really affecting the bacteria phenotype? And if it really does, how does it affecting the phenotype?
  • 3. And a system architecture like problem: There is a lot of two-component system (TCS) in its signal transduction network, so why chose TCS in its evolution time? This is an interesting problem because there is only one STK protein (Gene Id: XC_3631) in Xcc8004 and lots of HK in it from its genome annotation. The Eukaryotes chose STK and the Prokaryotes chose HK, this is interesting in the evolution time!
 
 
Image 2

Picture2. Typical system architecture of a TCS

These questions can be answer by both in traditional molecular experiment method and newly arise computational methods in bioinformatics. And I think me maybe answering these questions much better in the computational way. So I trying to build a new tools for my scientific research, the GCModeller.
 
Image 3
Picture3. Xcc 8004 genome circle diagram
 
Image 4
Picture4. Xcc 8004 virtual cell pathway network real-time visualization drawing comes from the GCModeller

How to applying your VB/C# skill on bioinformatcis

My first class about bioinformatics is the basically protein seuqnece alignment operation using the local blast tools from NCBI. the job of my programming skill on this operation is that I can writting the code to analysis the blast output text file to parsing the alignment result. although there is already have bunch of tools to analysis such data, like bio-perl/biopython. but I still insist on building my own tools to applying on the researches, this is not only I wanna to In-depth understanding about the working mechanism of the tools, but also I more desire to build something big using my favorite language.

Image 5

Actually, the VB/C# is not suitable to many mathematics algorithm calculation which was required of high performance to handling the data which its size more than 100GB or something more big in the bioinformatics analysis, the C++ is the right choice of such a huge mathematics analysis job. But VB/C# is still suits for bioinformatics from calling C++ API and using some .NET build-in tools to analysis the data in parallel. the VB/C# has a more clear of the language structure than C++ to let us build a rich function program easily.

One of the example is the analysis of the blastp result. the .NET Regular Expression tools and GDI+ is the best choice to analysis the data and then visualized the it.

Image 6

To answer how to apply our programming skill on the bioinformatcs, this is a difficult and big problem to answer due to the reason that many of the analysis in bioinformatics is specific with details, not so generic. but I still trying to give a generic method, you can try following my working steps to doing the bioinformatics if you are also interesting on it:

 

 

  • When I'm interesting on a problem in bioinformatics, so that at first of all I start to collect the information about this problem from the manual script paper from Microsoft Search(http://academic.research.microsoft.com/) or NCBI PMC(http://www.ncbi.nlm.nih.gov/pmc/)
  • Then I get enough information and able to start the programming in VisualBasic. Actually most of the bioinformatics analysis programming is from the common database in the biological institute (such as Genbank, KEGG, MetaCyc, Reactome, PDB) or experiment data file (such as fasta, fastaq, sam). all of these data files in bioinformatics is in the well formatted text file, so that the text parsing operations and regular expression is the most important skills in your VB/C# programming skills.

  • when we have the data from the common database or experiment data from lab, then we are able to analysis the data through some algorithm

  • at last not all of the images to represents your analysis result have the common tools to visualized, so that U have to writing your own program in VB/C#  by using the GDI+ to represents the result to other peoples. 

Here is some common libraries that I developed in my lab to facilities the analysis job, these libraries just some libraries to accomplished the common biological database read load job, not so relevant to the GCModeller core logic. Before our scientific manual script was published, the entire GCModeller source code is still not permitted to the public released.

  1. <a href="/KB/Tools-IDE/737439/Common_Libraries-noexe.zip">Download Common_Libraries-noexe.zip</a>
  2. <a href="/KB/Tools-IDE/737439/tutorials.pptx.zip">Download tutorials.pptx.zip</a>

The source code release of GCModeller will be hosting on SourceForge

http://sourceforge.net/projects/gcmodeller/

The shoal language in GCModeller is already been release at SourceForge

http://sourceforge.net/projects/shoal/ 

 

Introduce GCModeller

As the technology limits today, to answer those question mention above may be too difficult and takes lots of time as we employ the traditional molecular biology research procedures. The computational technology maybe is a better choice for answering this question which is about a huge interaction network problem through the method of we just simulate the network and get the answer from the calculation result.

I have create a platform for my whole life career

An article was publishing on the Nature in year 2011: “pathogenomics of Xanthomonas: understanding bacterium-plant interactions”, Then i start my further study after i graduate from university in 2012, the idea then comes out in my mind when i back to the university study again: shall we develop a simulating platform to apply on these problem researches? Then I coding nearly 1 year to build the GCModeller platform, from 2013 to 2014.

I mean maybe I will and I willing to devote my whole life in the career of the research on the bacterial Xcc8004 and the computational analysis of its interaction with its plant host. The GCModeller is the first step in this scientific research career.

As a kind of a plant pathogen, how does the plant host arabidopsis or radish interaction with the Xcc8004, the whole course of events will be represent from this simulation platform i build. Currently i just able to simulate a prokaryote on GCModeller, because it is difficult to build a eukaryote cell model as the Eukaryotes cell structure and genetic code is more complex than the Prokaryotes. And maybe I can finish this job for my doctor degree in the feature years.
Although such a huge project will takes me years to done this job, and it is worth for me to spend effort on this project. As all we want to get a easy way to modelling a cell through programming, and a cell system is like a .NET program assembly. So maybe we could introduce VisualBasic syntax like programming language to the genetics researchers from GCModeller in the feature to build a cell model. This is awesome!!

 

Current available Virtual Cell simulation platform

This table lists the virtual cell simulation platforms as far as i known:

Virtual Cell Platforms
Platform Bacteria Language  Home Page
vcell   C++ & Java http://vcell.org/
simtk Mycoplasma genitalium Matlab http://simtk.org/home/wholecell
E-Cell Ecoli C++ http://www.e-cell.org/
GCModeller Xanthomonas campestris pv. campestris str. 8004 VisualBasic (Unpublished)http://GCModeller.org/

 

  • vcell:   Moraru, II, et al. (2002). "The virtual cell: an integrated modeling environment for experimental and computational cell biology." Ann N Y Acad Sci 971(1): 595-596.

  • E-Cell:   Tomita, M., et al. (1999). "E-CELL: software environment for whole-cell simulation." Bioinformatics 15(1): 72-84.

  • simtk:   Karr, J. R., et al. (2012). "A whole-cell computational model predicts phenotype from genotype." Cell 150(2): 389-401.

 

“GCModeller”is short for the genome-in-code modeller or genetic clock modeller. The goal of the “genome-in-code” project is to create a virtual cell simulation platform on your desktop or server, GCModeller currently just support the bacterial simulation.

All of the component in the GCModeller is develop in visual studio 2013 and using VisualBasic.NET language, but all of the component source code can be easily convert in to the C# language using sharp develop.
Why I choosing VisualBasic language, here is about some reason:

 

  1. For its English like syntax and keyword, development IDE of VisualBasic is friendlier to the biological researchers. And I am a big fan of the BASIC serials language: QBASIC, VisualBasic 6, and VisualBasic.NET. I have nearly 7 years’ experience of programming using Basic language.
  2. For its fully support to the object-oriented programming feature, so that we are able to modeling the whole cell.
  3. The LINQ syntax in VisualBasic makes it more easily in the object query instead of so much For loop in the code to makes our code massive.
  4. VisualBasic support the MySQL free database server.
  5. VisualBasic is a kind of cross platform language, can running both on Windows/LINUX/MAC using mono runtime not wine.
  6. Create a GUI interface in VisualBasic just easy and looks nice.
  7. Using the Reflection operation can easily extend the program, and dynamic coding.
  8. Easy parallel computing in VisualBasic.
  9. Development using VisualBasic is quite smart and fast.

 

GCModeller Virtual Cell System

Our organization home page <a href="http://GCModeller.org">GCModeller.org</a> is come online now, but it is still in construction, I'm not finished it yet.

Image 7

Cover Picture -  Screenshot of GCModeller official site: GCModeller.org

The network image shows on GCModeller.org is the metabolism pathway visualization image of the plant pathogen bacterial Xanthomonas oryzae pv. oryzicola BLS256. The raw visualization data of the metabolism pathway can be download from MetaCyc database (http://www.metacyc.org/) or assembling from the KEGG database (http://www.genome.jp/kegg/pathway.html). when we have donwload the Xoc BLS256 MetaCyc database, then we can export the component elements interaction relationship from the pathways.dat database table file using GCModeller, and then the GCModeller can interacting with cytoscape software (http://cytoscape.org/) to drawing the network image. 

 

Download the full size network visualization image on the GCModeller.org 

  • <a href="/KB/Tools-IDE/737439/Xor.zip">Download Xor.zip</a>

 

Image 8

Bacteria genome visualization by VisualBasic code.

All of the core logical components in GCModeller is written in VisualBasic language, The GCModeller is mainly consist of two parts:

1. A sets of the bioinformatics/systems biology data analysis tools library

These libraries tools includes of

  • highthrough RNA-seq data analysis(Mainly includes sequence assemble, and genetics structure analysis based on RNA-seq data analysis.);
  • bacteria genome annotation tools(The genome annotation is based on the local blastp analysis (https://www.ncbi.nlm.nih.gov/) and many biological database such as: COG, Pfam, GO, KEGG, Regprecise, CDD)
  • Sequence pattern search tool and some sequence feature search tools such as TF regulatory motif site, CRISPR site... 
  • Simulation library of the whole cell computational analysis or just parts of the cellular network analysis of the gene expression regulation network/metabolism network.(Actually when you have finished the steps of the bacteria genome annotation, then you are enable to assembling the annotated components into a network model, or virtual cell computational model)
  • Data visualization library for the genome annotation/cellular network simulation data.(Large scale blast result visualization based on genbank database, sequence logo visualize of the sequence motif which was analysis from the MEME program(http://meme-suite.org/), 3D protein structure visualize engine based on GDI+ library, Cellular network visualize from programming interacting with cytoscape) 

2. New kind of scripting language runtime library

The GCModeller have 3 kinds of user interface(ShoalShell API for bioinformatcs researcher, .NET class library for the GCModeller developers and GCModeller IDE for the biological researchers, currently the ShoalShell API and .NET class API is just approaching finished and the IDE GUI is unstable on linux, still works on it). The shoal shell language was encouraged for user performance the GCModeller analysis. As it is original developed in .NET, and R/VisualBasic/CMD mixed language syntax, fully compatible of DataType with any .NET language, so that it can be easily embed into any .NET program. Recently successfully applied of the shoal shell script language was used on the <a href="http://mipaimai.com/">Mipaimai(http://mipaimai.com/)</a> real-time domain instance auction server administrators console. The shoal shell language power up the whole business logic on the auction server, to provides the services to the PC client and mobile client on Android phone.

 

We want to developing the "genome-in-code" project as a global world wide collaborative project on the computational biological research, just likes the Linux system does. So that we choose to totally open source of the whole GCModeller. The development of the GCModeller was benefit from the <a href="http://codeproject.com">codeproject.com</a> and codeplex.com these two most actively .NET open source community. most of the code for system architect and programming designing pattern was comes from the ideas of the excellent articles and projects in these two community. many thanks to you, And i also trying to post the articles about building the components in GCModeller on codeproject, to help other people who is also has the same idea as i'm. 

GCModeller Component List

Here is a list of article about the detail code implement information of the GCModeller components which are published on codeproject, and the list will be continues update with the development of GCModeller:

 

  • MYSQL database server adapter wrapper

This project implements the model data read and write operation on a MySQL database server and it makes the model compiler development more easily.

http://www.codeproject.com/Articles/638976/Visual-Basic-Using-Reflection-to-Map-DataTable-in

  • LINQ Script for query the biological database

This query script is originally comes from the LINQ syntax in the VisualBasic language, and you can use this script language to query the local biological database from the GCModeller IDE.

http://www.codeproject.com/Articles/721827/LINQ-Script-A-Universal-Object-Oriented-Database-Q

 

Image 9

 
  • * PLAS Metabolism simulation system core

Although we are finally using a modified simulation algorithm based on the popular FBA model not the PLAS model, but I still want to introduce this model to you as the book “Computational analysis of chemical system” about the PLAS model which is written by EO Voit is the first introduce to the scientific research area to me. all of the ideas in the genome-in-code project is comes from this book, this is my favorite book.

http://www.codeproject.com/Articles/664153/Modeling-the-Biochemical-System-Using-VB

 

[^] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2291792/

 

 

 

  • Mathematics calculation engine for the PLAS script

This module support the complex mathematics expression calculation in the PLAS model

http://www.codeproject.com/Articles/646391/A-complex-Mathematics-expression-evaluation-module

 

  • GCModeller IDE plugin system

This is a plugin system for the GCModeller IDE; it is based on the reflection technology to dynamic load the command to the IDE menu.

http://www.codeproject.com/Articles/703590/Develop-a-Plugin-extension-for-your-VisualBasic-ap

  • The data exchange library

The csv file format is the most use and common format in the bioinformatics programming in R, here is the csv file format wrapper in my project which is used for exchange the data between the genome-in-code program and R server. The extension method in this library makes my coding job more easily!

http://www.codeproject.com/Articles/788006/A-powerful-CSV-document-wrapper-library

Actually as I have said, the VisualBasic is not good at mathematics algorithm, so that most of the large and required of precise calculation was performance on the R language. So that the ability of the GCModeller hybrids programming betweenVB/R is important. These two library give the power of the hybrids programming:

http://www.codeproject.com/Articles/832975/Guide-line-of-integrated-ShellScript-with-R-Hybrid

http://www.codeproject.com/Articles/890099/R-language-S-Object-Serialization-to-NET-Object

 

  • VisualBasic ShellScript for systems biology

Here is a new kind of script language which was original developed in VisualBasic.NET, and it is using for the systems biology research, a lot of function was included in this "genome-in-code" project api library: experiment data analysis and data visualization:

http://www.codeproject.com/Articles/820854/Powerful-ShellScript-for-bioinformatics-researcher

Now you can download the source code of this new scripting language from SourceForge(http://sourceforge.net/projects/shoal/)

 

Xcc8004 genetic clock diagram

Image 10

The genetic clock is the most interesting

phenomenon of the bacteria genome wide regulation, the GCModeller is the brief name of Genetic Clock Modeller or Genome-in-Code Modeller, and this virtual cell platform is original aim at this  phenomenon study.

 

 

 

Article about the genetic clock research on nature

[^] http://www.nature.com/nature/journal/v463/n7279/abs/nature08753.html

GCModeller IDE

We have done the developing job of the GCModeller simulation engine kernel, and this engine is under the laboratory testing now. And the whole project compiled assembly will be release publishes from SourceForge and codeproject before our scientific research article was published. The GCModeller gets a great GUI interface for the genetics or molecular biology researchers; I just finish the GUI framework development in February this year and upload this GUI Framework here to share with you.

Image 11

Picture6. GCModeller IDE screenshoot

The IDE has a visual studio 2010 like GUI interface, and it is based on the WinForm technology not WPF for this GUI IDE running on the Linux platform as the WinForm is more compatible to the LINUX desktop environment on mono runtime environment than WPF does from my testing. Although the IDE can run on the LINUX platform, but there is a lot of buggy problem in the mono WinForm, it is unstable now and you didn’t have a best experience of this IDE on Linux. Maybe the problems will be solved in the next update version of the mono runtime

Special thanks

The Form base of GCModeller IDE is original comes from here:

http://www.codeproject.com/Articles/138661/Metro-UI-Zune-like-Interface-form

The interaction between genome-in-code program and R server was interact by RDotNET

https://rdotnet.codeplex.com/

Mathematics library for the data analysis in GCModeller

http://www.alglib.net/

The RNA-seq data was analysis by the high performance library in GCModeller which was wrapped from BOW project.

http://bow.codeplex.com/

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)