Parsing And Visualization Of GenBank

From Protein Prediction 2 Winter Semester 2014

About the project

The Genbank contains many annotated sequences and these can be visualized and also the features that occur in this sequence can be displayed,selected,exported. Although the Genbank is very popular in Academia, in industry people dont tend to publish annotated sequences but rather these are maintained in their own Databases, these are the proprietary sequences. In order for the bioinformaticians working on this to visualize this sequence, they are again dependent on propreitary software that are developed as Desktop applications but not Web applications, the major problem with this is that the lab technicians, lose a lot of time doing this not being able to visualize the sequence immediately.

The main task of our project is not only, provided a genbank file parse it and visualize it. But also build it in such a way that it can be easily included in other projects. Although these are the primary goals of our project, there are a few more functional requirements(which can be seen here) and also some features need to be built into the project ( explained here).


This section describes all the requirements that we have identified after our meeting with our mentors.

GUI mockups

User experience:

  • Quoting the Mentor (Dr.El Mazouari)*

“Practical use case: a team is developing a web app that implements in-house algorithms for annotated in-house proprietary sequences. The web app screens the company sequence database for specific set of features. Sequence hits are then annotated and a Genbank output is generated for annotated sequences. At this time, wet-lab users download the annotated sequence in Genbank format and then open it in VectorNTi or MacVector in order to view the annotation map and features. These extra steps are time consuming… If they can view the sequence directly in their browser from the web, they will be more productive and “happy scientists;)” Something that will help them to view the annotated sequence, select the features they want and export them will be very welcome”


  • Select features from annotated input sequences
  • Parse and Visualize the input sequence in the genbank format
  • Export selected features which should be able to work with later*


Due to the complex nature and scope of our project, we decided to build the project in a very modular way, so as to make the functionality we build available to others. Each module(feature) deals with one specific problem that can be reused in any project.

Class Diagram

Seen below is the Class Diagram that we drew.

Di Domenico Classdiagram.jpeg

Design Pattern

As directly understandable from the Class Diagram, the design pattern used is the Model View Controller MVC . This decision was made after the requirement elicitation. The necessity to build a web application that has to Pars, Visualize and Export features inside a sequence in GenBank format brought the team to define 3 Macro Areas strictly related to the features that have to be developed.


Each of the input sequences sent as input has to be parsed in order to obtain it's Features. Since no parsing method was provided the team made some research on the internet finding a tutorial on how to build a parse in JavaScript. The tutorial has been used to create a basic version of the parser, but, since it was not enough, it has been changed and adapted to the specific of this project.

Biojs npm

As as mentioned above no JavaScript code was available for the parser so, together with the tutor, the code for the parser has been uploaded on the Biojs npm. Unfortunately, due to the team's inexperience with it, all the changes the team is still making to it are being done on a GitHub Repository. This has been done to provide the parser also to other users in the future.

BIOJS Repository

GitHub Repository

npm Repository

User Friendly Feature

The Client has expressly ask for a very intuitive GUI. This requirement will be fulfilled later on. Major priority will be given to the other features.

Highlighting Feature

Each feature parsed from a sequence will be highlighted so that the user can easily found it. The team intention is to:

  • First, develop a intuitive way to highlight the features
  • Then, find a prettier way to directly visualize the feature itself, maybe using d3 functions.

In the team's PP2_DiDomenico_Gigantiello_Krishnamurthy Git Hub repository, a feature branch has been created to work on it.

Extract Feature

Each feature parsed from a sequence has to be exportable. The team is still making some research on how to realize and develop this feature. The main idea they come up with is:

  • Creating a Check Box list with in all the features. The selected ones will be the exported.

In the team's PP2_DiDomenico_Gigantiello_Krishnamurthy Git Hub repository, a feature branch has been created to work on it.

Integration Feature

The whole application has to be able to be easily integrated into other web applications. In order to do this, the entire project has been divided in Macro Areas. The team's intention is to make the code and the classes as much decoupled as possible, thus integrating the full project, or just a single part of it, will be possible.  


MockUps done before contacting the client:

GenBank input Prior to Mentor's directive:

  • GenBank input Prior to Mentor's directive

GenBank output Prior to Mentor's directive:

  • GenBank output Prior to Mentor's directive

Refactored MockUps resulting from Mentors updates:

GenBank input After to Mentor's directive:

  • GenBank input After to Mentor's directive

GenBank output After to Mentor's directive: 

  • GenBank output After to Mentor's directive

Application design

Expected technical difficulties

  • Implementing the parser
  • Selecting and exporting features dynamically
  • Highlighting multiple features

Fancy libraries you plan to use

  • Jquery
  • D3 (if necessary)
  • BioJava
  • BioJS(?)

Your data

Remarks about your input format

  • The input is going to be annotated sequence and it should be in the Genbank Format


Before Meeting the Mentor

  • 1. Understand the application domain and the logic that we are supposed to implement
  • 2. Understand the input we need to work ( how to convert sequence in genbank format to genbank file )
  • 3. Parse the genbank file and extract the features dynamically into Javascript objects.
  • 4. Visualize the results in a user friendly in the browser
  • 5. Make the features exportable!

After Meeting the Mentor

  • 1 Meeting (email):

The developer team wrote a first email to the Mentor Dr.El Mazouari, so that they could get to know both the Project and the Client.

    • The discussion with the client was around the problem statement. The team decided to make for the Mentor,Dr. El Mazouari, a list of questions containing all the doubts they had. The first clarification was around the nature of the data that has to be handled, and, the application end-user, that, as a bioinformatician, is interested in Sequence Annotation, no matter what sequence he is working on. The main focus concerning this data has to go in its Visualization and Presentation. Data must be user-friendly and easy to understand.
    • Second point of the discussion was around the reason beneath the decision to create a new sequence parser' when there is already Genbank. The problems here is that Genbank is public and most of the industries will not use it, thus companies will not upload their sequence to public DB by default. Huge amount of industry sequences are in-house sequences that must be processed in-house.
    • Third point was on the necessity to have a Web-Application. Since there are many Desktop Applications that already read annotated sequences (mostly in GenBank format), the team has to develop a Web-Application.
    • Fourth Dr.El Mazouari clarified some doubts around the Bio-Libraries. He introduced two of them: BioJava and BioPerl. The problem with these is that, as easily understandable from their name, they are not written in Javascript, therefore the team will have to choose from Bridging the Java/Perl code and the Javascript-Application or Writing their own parser .
    • Fifth point in the discussion was 'The Mockups presented': The Mentor told the team that the only Input type they have to accept is the sequence, thus they don't need to implement the Search for id feature. Since the final users will be Bioinformaticians and the main focus is to make a user-friendly application for them, Dr. El Mazouari asked to remove the Customization Feature, so not to confuse them.
    • Finally the team was asked to implement an additional feature in order to make selected features extracted from an annotated sequences Exportable.
  • 2 Meeting (Facetime video-chat):

After the first contact with the Mentor, the team started brainstorming around the problem statement, and the result of this was the refinement of the functional requirements that Dr. El Mazouari requested. We then asked to have another exchange with the Mentor. Since one of the topics of the discussion was concerning biological arguments and the team did not enough expertise in this field, Tutor Tatyana Goldberg offered to join the Facetime talk.

Before the discussion we had three main questions, The following section explains our questions and also the information given to us by our mentor

    • First The first question that we had was regarding the input, more specifically if the input was a sequence or genbank file. After our conversation with the mentor it was clear that, the system should be built in such way that it should only accept a string. This string could however come from a database, file, text was not important.
    • Second The second question was about the parser, if there was any library which already accomplished this feature. But we found that the only available alternate was written in Java ( part of the biojava project) or in Perl ( part of the bioperl project) and there was nothing written in javascript which could be reused and it should be built from the scratch. However the idea or the strategy from the Bioprojects could be reused.
    • Third The last question was regarding the export feature, it was not very clear how the features were supposed to be exported (format). The mentor instructed us that the features should be exportable as plain text and further more multiple features should be exportable at once, the application should also provide an interface to select which features would be able to exportable.

As one of the advanced features, the mentor asked us that it would be interesting if multiple sequences could be parsed at once.

One more requirement that was asked by the mentor was to make the code as easily reusable as possible. As he develops in Java, the project that we develop should be easily importable in a Java EE project and be part of a web application.

Project Links

GenBank Parser

In this section you'll find all the links needed to have access to the 'GenBank-parser':

Visualization Component

In this section you'll find all the links needed to have access to the 'Visualization Component':