Parsing And Visualization Of GenBank

From Protein Prediction 2 Winter Semester 2014
Revision as of 15:24, 17 December 2014 by Ppwikiuser (talk | contribs) (Features)

About the project

The Genbank contains many annotated sequences and these can be visualized and also the features that occur in this sequence can be displayed,selected,exported. Although the Genbank is very popular in Academia, in industry people dont tend to publish annotated sequences but rather these are maintained in their own Databases, these are the proprietary sequences. In order for the bioinformaticians working on this to visualize this sequence, they are again dependent on propreitary software that are developed as Desktop applications but not Web applications, the major problem with this is that the lab technicians, lose a lot of time doing this not being able to visualize the sequence immediately.

The main task of our project is not only, provided a genbank file parse it and visualize it. But also build it in such a way that it can be easily included in other projects. Although these are the primary goals of our project, there are a few more functional requirements(which can be seen here) and also some features need to be built into the project ( explained here).


This section describes all the requirements that we have identified after our meeting with our mentors.

GUI mockups

User experience:

  • Quoting the Mentor (Dr.El Mazouari)*

“Practical use case: a team is developing a web app that implements in-house algorithms for annotated in-house proprietary sequences. The web app screens the company sequence database for specific set of features. Sequence hits are then annotated and a Genbank output is generated for annotated sequences. At this time, wet-lab users download the annotated sequence in Genbank format and then open it in VectorNTi or MacVector in order to view the annotation map and features. These extra steps are time consuming… If they can view the sequence directly in their browser from the web, they will be more productive and “happy scientists;)” Something that will help them to view the annotated sequence, select the features they want and export them will be very welcome”


  • Select features from annotated input sequences
  • Parse and Visualize the input sequence in the genbank format
  • Export selected features which should be able to work with later*


Due to the nature of the project, the team spent some time designing what they thought it would be the most correct structure of the progect. Reported below is the Class Diagram that the team drew.

Di Domenico Classdiagram.jpeg

  • Easy to use for the end users
  • Highlight and export features in a user friendly
  • Should easily be able to integrated into other web applications.


MockUps done before contacting the client:

GenBank input Prior to Mentor's directive:

  • GenBank input Prior to Mentor's directive

GenBank output Prior to Mentor's directive:

  • GenBank output Prior to Mentor's directive

Refactored MockUps resulting from Mentors updates:

GenBank input After to Mentor's directive:

  • GenBank input After to Mentor's directive

GenBank output After to Mentor's directive: 

  • GenBank output After to Mentor's directive

Application design

Expected technical difficulties

  • Implementing the parser
  • Selecting and exporting features dynamically
  • Highlighting multiple features

Fancy libraries you plan to use

  • Jquery
  • D3 (if necessary)
  • BioJava
  • BioJS(?)

Your data

Remarks about your input format

  • The input is going to be annotated sequence and it should be in the Genbank Format


Before Meeting the Mentor

  • 1. Understand the application domain and the logic that we are supposed to implement
  • 2. Understand the input we need to work ( how to convert sequence in genbank format to genbank file )
  • 3. Parse the genbank file and extract the features dynamically into Javascript objects.
  • 4. Visualize the results in a user friendly in the browser
  • 5. Make the features exportable!

After Meeting the Mentor

  • 1 Meeting (email):

The developer team wrote a first email to the Mentor Dr.El Mazouari, so that they could get to know both the Project and the Client.

    • The discussion with the client was around the problem statement. The team decided to make for the Mentor,Dr. El Mazouari, a list of questions containing all the doubts they had. The first clarification was around the nature of the data that has to be handled, and, the application end-user, that, as a bioinformatician, is interested in Sequence Annotation, no matter what sequence he is working on. The main focus concerning this data has to go in its Visualization and Presentation. Data must be user-friendly and easy to understand.
    • Second point of the discussion was around the reason beneath the decision to create a new sequence parser' when there is already Genbank. The problems here is that Genbank is public and most of the industries will not use it, thus companies will not upload their sequence to public DB by default. Huge amount of industry sequences are in-house sequences that must be processed in-house.
    • Third point was on the necessity to have a Web-Application. Since there are many Desktop Applications that already read annotated sequences (mostly in GenBank format), the team has to develop a Web-Application.
    • Fourth Dr.El Mazouari clarified some doubts around the Bio-Libraries. He introduced two of them: BioJava and BioPerl. The problem with these is that, as easily understandable from their name, they are not written in Javascript, therefore the team will have to choose from Bridging the Java/Perl code and the Javascript-Application or Writing their own parser .
    • Fifth point in the discussion was 'The Mockups presented': The Mentor told the team that the only Input type they have to accept is the sequence, thus they don't need to implement the Search for id feature. Since the final users will be Bioinformaticians and the main focus is to make a user-friendly application for them, Dr. El Mazouari asked to remove the Customization Feature, so not to confuse them.
    • Finally the team was asked to implement an additional feature in order to make selected features extracted from an annotated sequences Exportable.
  • 2 Meeting (Facetime video-chat):

After the first contact with the Mentor, the team started brainstorming around the problem statement, and the result of this was the refinement of the functional requirements that Dr. El Mazouari requested. We then asked to have another exchange with the Mentor. Since one of the topics of the discussion was concerning biological arguments and the team did not enough expertise in this field, Tutor Tatyana Goldberg offered to join the Facetime talk.

Before the discussion we had three main questions, The following section explains our questions and also the information given to us by our mentor

    • First The first question that we had was regarding the input, more specifically if the input was a sequence or genbank file. After our conversation with the mentor it was clear that, the system should be built in such way that it should only accept a string. This string could however come from a database, file, text was not important.
    • Second The second question was about the parser, if there was any library which already accomplished this feature. But we found that the only available alternate was written in Java ( part of the biojava project) or in Perl ( part of the bioperl project) and there was nothing written in javascript which could be reused and it should be built from the scratch. However the idea or the strategy from the Bioprojects could be reused.
    • Third The last question was regarding the export feature, it was not very clear how the features were supposed to be exported (format). The mentor instructed us that the features should be exportable as plain text and further more multiple features should be exportable at once, the application should also provide an interface to select which features would be able to exportable.

As one of the advanced features, the mentor asked us that it would be interesting if multiple sequences could be parsed at once.

One more requirement that was asked by the mentor was to make the code as easily reusable as possible. As he develops in Java, the project that we develop should be easily importable in a Java EE project and be part of a web application.