Difference between revisions of "IAnnSolrMigration"

From Protein Prediction 2 Winter Semester 2014
(Solutions required)
(Project progress)
Line 53: Line 53:
 
! Week 1
 
! Week 1
 
|
 
|
  +
* Understanding requirements.
* Collecting ideas for possible visualization components that can be incorporated in iAnn viewer.
 
  +
* Installing solr and configuring virtual environment.
* Understanding the working of iAnn viewer.
 
  +
* Running solr 4.1.0 with old schema and understanding it.
* Underwent a tutorial on D3 javascript.
 
  +
* Meetings.
* Vreatinga GitHub Repository.
 
  +
* Creating GitHub Repository.
 
|[[File:check.jpg|center|50px|50px]]
 
|[[File:check.jpg|center|50px|50px]]
 
|-
 
|-
 
! Week 2
 
! Week 2
 
|
 
|
  +
* Splitting idea to multiple tasks and writing proposal.
* Finalizing the ideas.
 
  +
* Presenting proposal to Rafael and discussing changes.
* Developing the word cloud based on Host, Keyword, Provider and Country.
 
  +
* Creating new schema.
* Integrating the word cloud with iAnn Viewer.
 
  +
* Installing and configuring solr 4.10.2.
* Initial Project setup for 3D Globe.
 
  +
* Meeting with Rafael.
 
|[[File:check.jpg|center|50px|50px]]
 
|[[File:check.jpg|center|50px|50px]]
 
|-
 
|-
 
! Week 3
 
! Week 3
 
|
 
|
  +
* Getting feedback on the schema.
* Enhancing the word cloud.
 
  +
* Updating new schema.
* Bringing 3D Globe into a working condition.
 
  +
* Creating new repository and writing README instructions.
* Initial project setup for Partition Layout .
 
  +
* Meeting with Rafael.
 
|[[File:check.jpg|center|50px|50px]]
 
|[[File:check.jpg|center|50px|50px]]
 
|-
 
|-
Line 77: Line 80:
 
! Week 4
 
! Week 4
 
|
 
|
  +
* Script for reading solr data.
* Word Cloud Completion.
 
  +
* Script for Writing temporary JSON files.
* Adding tooltip and list of events to each country.
 
* Adding Markers for 3D Globe.
+
* Script for Indexing JSON files.
  +
* Correcting new schema.
  +
* Starting building User Interface.
  +
* Meeting with Rafael.
 
|[[File:check.jpg|center|50px|50px]]
 
|[[File:check.jpg|center|50px|50px]]
 
|-
 
|-
Line 85: Line 91:
 
! Week 5
 
! Week 5
 
|
 
|
  +
* Dividing UI in three parts; configuration, fetching and migration.
* Incorporating 2 options : Ortho(3D) and Equirectangular(2D) for better user interface.
 
  +
* Adding option for old and new schema upload and automatic XML parsing in browser.
* Automatic Zoom in.
 
  +
* Adding functionality to select related parsed names from old and new schema.
* Stabilizing the 3D to 2D transition.
 
  +
* Adding support for the most common types auto detection and conversion.
* Completion of 3D Globe.
 
  +
* Configuring linode virtual machine, running scripts and Web user interface.
  +
* Testing.
  +
* Sending a status email to Rafael.
 
|[[File:check.jpg|center|50px|50px]]
 
|[[File:check.jpg|center|50px|50px]]
 
|-
 
|-
 
! Week 6
 
! Week 6
 
|
 
|
  +
* Updating wiki pages.
 
  +
* Communicating with Tatyana regarding my absence during final presentation.
  +
* Preparing presentation.
  +
* Handing in documents for the final presentation and communicating with Rafael, Tatyana and other two team members.
  +
|[[File:check.jpg|center|50px|50px]]
 
|-
 
|-
   

Revision as of 22:09, 1 January 2015


Introduction


iAnn is a platform which collects and distributes the data about various events, like symposiums, presentations, workshops, conferences etc.

The aim of iAnn platform is:

  • To save time, effort and cost of manual announcement curation for scientific organizations.
  • To facilitate dissemination, sharing and promotion of scientific announcements.
  • To avoid redundancy in manual curation of announcement data.

Problem description

While fetching the data from different sources, one needs to standardize it and name all properties in such way that name itself is descriptive and makes sense. Naming examples: submission_date, keywords, host_institution.

However, naming is part of agreement from the SASI document [sasi link] and it is straight forward. The challenge is to migrate all imported data from old to new schema. Solr does not provide automatized way and custom scripts have to be created. While working on customized scripts we considered several points;

  • scripts must be easy to modify
  • they must handle different types conversions while migrating
  • they must be able to handle different naming structures
  • they must provide a sort of feedback while migrating (process)
  • scripts must be configured on linode virtual server
  • migration from solr 4.1.0 to 4.10.2 should be done

Solutions required

We decided to divide our work to two separate tasks:

  • Creating/configuring SASI solr schema and upgrading solar service to new version
  • Writing scripts for data reindexing
  • Creating Web tool for easier control during migration
Project 1: Creating/configuring SASI solr schema and upgrading solar service to new version

We considered SASI document (sasi03.xml) as agreement for the naming structure of our data API. Old schema has to be updated and new collection in database has to be created. On top of that, solr service has to be reconfigured and migrated from older version 4.1.0 to 4.10.2.

Project 2: Writing scripts for data reindexing

After having the new schema and upgraded solr running, old data has to be fetched, modified and saved to the new collection. For this purpose we focused on creating scripts for solr migration.

Project 3: Creating Web tool for easier control during migration

To make everything more stable we focused on building a Web user interface where user can trigger migration in a browser. A tool should have RESTful architecture and it should be able to communicate with web server and guide user through the process. It should also allow user to connect old and new names from schema and auto-type detection and conversion. Example of automated data detection and conversion would be the case when we want to migrate array field keywords to the string. In this case we should detect array type, convert it to the string and then copy to new collection.

Project progress


Project progress
Week number Completed tasks Status
Week 1
  • Understanding requirements.
  • Installing solr and configuring virtual environment.
  • Running solr 4.1.0 with old schema and understanding it.
  • Meetings.
  • Creating GitHub Repository.
Check.jpg
Week 2
  • Splitting idea to multiple tasks and writing proposal.
  • Presenting proposal to Rafael and discussing changes.
  • Creating new schema.
  • Installing and configuring solr 4.10.2.
  • Meeting with Rafael.
Check.jpg
Week 3
  • Getting feedback on the schema.
  • Updating new schema.
  • Creating new repository and writing README instructions.
  • Meeting with Rafael.
Check.jpg
Week 4
  • Script for reading solr data.
  • Script for Writing temporary JSON files.
  • Script for Indexing JSON files.
  • Correcting new schema.
  • Starting building User Interface.
  • Meeting with Rafael.
Check.jpg
Week 5
  • Dividing UI in three parts; configuration, fetching and migration.
  • Adding option for old and new schema upload and automatic XML parsing in browser.
  • Adding functionality to select related parsed names from old and new schema.
  • Adding support for the most common types auto detection and conversion.
  • Configuring linode virtual machine, running scripts and Web user interface.
  • Testing.
  • Sending a status email to Rafael.
Check.jpg
Week 6
  • Updating wiki pages.
  • Communicating with Tatyana regarding my absence during final presentation.
  • Preparing presentation.
  • Handing in documents for the final presentation and communicating with Rafael, Tatyana and other two team members.
Check.jpg

Proposal for Schema changes


Current version of iAnn website is delivering a service by using custom schema for data attributes naming. However, it is not standardized and does not have default field naming scheme. Therefore, a community agreed on default schema layout, named SASI. In this document we will point out differences and describe new schema fields. As current iAnn collection runs on Apache Solr 4.1 and the latest one with some major bug fixes is 4.10.2, we decided to migrate and update our service as well.

Accomplished tasks:

  • Understanding of existing scheme,
  • Proposing structure changes and
  • Highlighting differences


Old Scheme:

  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="title" type="text" indexed="true" stored="true"/>
  <field name="subtitle" type="text" indexed="true" stored="true"/>
  <field name="description" type="string" indexed="false" stored="true"/>
  <field name="provider" type="text_lowercase" indexed="true" stored="true"/>
  <field name="link" type="string" indexed="false" stored="true"/>
  <field name="start" type="tdate" indexed="true" stored="true"/>
  <field name="end" type="tdate" indexed="true" stored="true"/>
  <field name="venue" type="text_lowercase" indexed="true" stored="true"/> 
  <field name="city" type="text_lowercase" indexed="true" stored="true"/>
  <field name="county" type="string" indexed="fasle" stored="true"/>
  <field name="country" type="text_lowercase" indexed="true" stored="true"/>
  <field name="postcode" type="string" indexed="false" stored="true"/>
  <field name="attachment" type="string" indexed="false" stored="true" multiValued="true"/>
  <field name="image" type="string" indexed="false" stored="true" multiValued="true"/>
  <field name="keyword" type="text_lowercase" indexed="true" stored="true" multiValued="true"/>
  <field name="category" type="text_lowercase" indexed="true" stored="true" multiValued="true"/>
  <field name="field" type="text_lowercase" indexed="true" stored="true" multiValued="true"/> 
  <field name="submission_name" type="text_lowercase" indexed="true" stored="true" multiValued="true"/>
  <field name="submission_email" type="string" indexed="true" stored="true" multiValued="true"/>
  <field name="submission_date" type="tdate" indexed="true" stored="true" multiValued="true"/>
  <field name="submission_comment" type="text" indexed="false" stored="true" multiValued="true"/>
  <field name="submission_organization" type="text" indexed="true" stored="true" multiValued="true"/>
  <field name="latitude" type="double" indexed="true" stored="true"/>
  <field name="longitude" type="double" indexed="true" stored="true"/>
  <field name="sponsor" type="text_lowercase" indexed="true" stored="true"/>
  <field name="public" type="boolean" indexed="true" stored="true"/>
  <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
  <field name="_version_" type="long" indexed="true" stored="true"/>


New Scheme:

  <field name="id" type="text" indexed="true" stored="true" required="true" />
  <field name="title" type="text" indexed="true" stored="true"/>
  <field name="subtitle" type="text" indexed="true" stored="true"/>
  <field name="description" type="text" indexed="false" stored="true"/>
  <field name="prerequisites" type="text" indexed="false" stored="true"/>
  <field name="programme" type="programme" indexed="false" stored="true"/>
  <field name="comments" type="text" indexed="false" stored="true"/>
  <field name="fees" type="fees" indexed="false" stored="true" multiValued="true"/>
  <field name="discount" type="discount" indexed="false" stored="true"/>
  <field name="accreditation" type="text" indexed="false" stored="true"/>
  <field name="status" type="text" indexed="false" stored="true"/>
  <field name="eligibility" type="text" indexed="false" stored="true" multiValued="true"/>
  <field name="capacity" type="int" indexed="false" stored="true"/>
  <field name="contact" type="person" indexed="false" stored="true" multiValued="true"/>
  <field name="submitter" type="person" indexed="false" stored="true" multiValued="true"/>
  <field name="organizers" type="person" indexed="false" stored="true" multiValued="true"/>
  <field name="speakers" type="person" indexed="false" stored="true" multiValued="true"/>
  <field name="host_institution" type="organization" indexed="false" stored="true" multiValued="true"/>
  <field name="sponsor" type="organization" indexed="false" stored="true" multiValued="true"/>
  <field name="venue" type="text" indexed="false" stored="true"/>
  <field name="street_address" type="text" indexed="false" stored="true"/>
  <field name="city" type="text" indexed="false" stored="true"/>
  <field name="province" type="text" indexed="false" stored="true"/>
  <field name="country" type="text" indexed="false" stored="true"/>
  <field name="postcode" type="text" indexed="false" stored="true"/>
  <field name="post_office_box" type="text" indexed="false" stored="true"/>
  <field name="url" type="link" indexed="false" stored="true"/>
  <field name="attachment" type="link" indexed="false" stored="true" multiValued="true"/>
  <field name="social_media" type="link" indexed="false" stored="true" multiValued="true"/>
  <field name="starts" type="date" indexed="false" stored="true"/>
  <field name="ends" type="date" indexed="false" stored="true"/>
  <field name="time_zone" type="text" indexed="false" stored="true"/>
  <field name="last_update" type="date" indexed="false" stored="true" multiValued="true"/>
  <field name="deadlines" type="date" indexed="false" stored="true" multiValued="true"/>
  <field name="registration_opens_date" type="date" indexed="false" stored="true" multiValued="true"/>
  <field name="acceptance_notification_date" type="date" indexed="false" stored="true" multiValued="true"/>
  <field name="type" type="text" indexed="false" stored="true" multiValued="true"/>
  <field name="topic" type="text" indexed="false" stored="true" multiValued="true"/>
  <field name="public" type="boolean" indexed="false" stored="true"/>
  <field name="target_audience" type="text" indexed="false" stored="true" multiValued="true"/>
  <field name="spotlight" type="boolean" indexed="false" stored="true"/>
  <field name="latitude" type="double" indexed="true" stored="true"/>
  <field name="longitude" type="double" indexed="true" stored="true"/>
  <field name="_version_" type="long" indexed="true" stored="true"/>


Conclusion:

While using newly proposed scheme iAnn will have a possibility to serve even more diverse data. This will serve as a basis for new services on top of Solr search engine. On top of that, migration to newer version of Solr will make it safer.

GitHub Repository

GitHub : iANN_Solr