IAnnSolrMigration
Contents
Introduction
iAnn is a platform which collects and distributes the data about various events, like symposiums, presentations, workshops, conferences etc.
The aim of iAnn platform is:
- To save time, effort and cost of manual announcement curation for scientific organizations.
- To facilitate dissemination, sharing and promotion of scientific announcements.
- To avoid redundancy in manual curation of announcement data.
Problem description
While fetching the data from different sources, one needs to standardize it and name all properties in such way that name itself is descriptive and makes sense. Naming examples: submission_date, keywords, host_institution.
However, naming is part of agreement from the SASI document [sasi link] and it is straight forward. The challenge is to migrate all imported data from old to new schema. Solr does not provide automatized way and custom scripts have to be created. While working on customized scripts we considered several points;
- scripts must be easy to modify
- they must handle different types conversions while migrating
- they must be able to handle different naming structures
- they must provide a sort of feedback while migrating (process)
- scripts must be configured on linode virtual server
- migration from solr 4.1.0 to 4.10.2 should be done
Solutions required
We decided to divide our work to two separate tasks:
- Creating/configuring SASI solr schema and upgrading solar service to new version
- Writing scripts for data reindexing
- Creating Web tool for easier control during migration
Project 1: Creating/configuring SASI solr schema and upgrading solar service to new version
We considered SASI document (sasi03.xml) as agreement for the naming structure of our data API. Old schema has to be updated and new collection in database has to be created. On top of that, solr service has to be reconfigured and migrated from older version 4.1.0 to 4.10.2.
Project 2: Writing scripts for data reindexing
After having the new schema and upgraded solr running, old data has to be fetched, modified and saved to the new collection. For this purpose we focused on creating scripts for solr migration.
Project 3: Creating Web tool for easier control during migration
To make everything more stable we focused on building a Web user interface where user can trigger migration in a browser. A tool should have RESTful architecture and it should be able to communicate with web server and guide user through the process. It should also allow user to connect old and new names from schema and auto-type detection and conversion. Example of automated data detection and conversion would be the case when we want to migrate array field keywords to the string. In this case we should detect array type, convert it to the string and then copy to new collection.
Project progress
Week number | Completed tasks | Status |
---|---|---|
Week 1 |
|
|
Week 2 |
|
|
Week 3 |
|
|
Week 4 |
|
|
Week 5 |
|
|
Week 6 |
|
Deliverable 1: Proposal for Schema changes and updating solr to 4.10.2
Current version of iAnn website is delivering a service by using custom schema for data attributes naming. However, it is not standardized and does not have default field naming scheme. Therefore, a community agreed on default schema layout, named SASI. In this document we will point out differences and describe new schema fields. As current iAnn collection runs on Apache Solr 4.1 and the latest one with some major bug fixes is 4.10.2, we decided to migrate and update our service as well.
Accomplished tasks:
- Understanding of existing scheme,
- Proposing structure changes and
- Highlighting differences
Old Schema:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text" indexed="true" stored="true"/> <field name="subtitle" type="text" indexed="true" stored="true"/> <field name="description" type="string" indexed="false" stored="true"/> <field name="provider" type="text_lowercase" indexed="true" stored="true"/> <field name="link" type="string" indexed="false" stored="true"/> <field name="start" type="tdate" indexed="true" stored="true"/> <field name="end" type="tdate" indexed="true" stored="true"/> <field name="venue" type="text_lowercase" indexed="true" stored="true"/> <field name="city" type="text_lowercase" indexed="true" stored="true"/> <field name="county" type="string" indexed="fasle" stored="true"/> <field name="country" type="text_lowercase" indexed="true" stored="true"/> <field name="postcode" type="string" indexed="false" stored="true"/> <field name="attachment" type="string" indexed="false" stored="true" multiValued="true"/> <field name="image" type="string" indexed="false" stored="true" multiValued="true"/> <field name="keyword" type="text_lowercase" indexed="true" stored="true" multiValued="true"/> <field name="category" type="text_lowercase" indexed="true" stored="true" multiValued="true"/> <field name="field" type="text_lowercase" indexed="true" stored="true" multiValued="true"/> <field name="submission_name" type="text_lowercase" indexed="true" stored="true" multiValued="true"/> <field name="submission_email" type="string" indexed="true" stored="true" multiValued="true"/> <field name="submission_date" type="tdate" indexed="true" stored="true" multiValued="true"/> <field name="submission_comment" type="text" indexed="false" stored="true" multiValued="true"/> <field name="submission_organization" type="text" indexed="true" stored="true" multiValued="true"/> <field name="latitude" type="double" indexed="true" stored="true"/> <field name="longitude" type="double" indexed="true" stored="true"/> <field name="sponsor" type="text_lowercase" indexed="true" stored="true"/> <field name="public" type="boolean" indexed="true" stored="true"/> <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> <field name="_version_" type="long" indexed="true" stored="true"/>
New Schema:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text" indexed="true" stored="true"/> <field name="subtitle" type="text" indexed="true" stored="true"/> <field name="description" type="string" indexed="false" stored="true"/> <field name="prerequisites" type="text" indexed="false" stored="true"/> <field name="programme" type="text" indexed="false" stored="true"/> <field name="comments" type="text" indexed="false" stored="true"/> <field name="fees" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="discount" type="text_lowercase" indexed="false" stored="true"/> <field name="accreditation" type="text_lowercase" indexed="false" stored="true"/> <field name="status" type="text_lowercase" indexed="false" stored="true"/> <field name="eligibility" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="capacity" type="long" indexed="false" stored="true"/> <field name="contact_name" type="text" indexed="false" stored="true" multiValued="true"/> <field name="contact_email" type="string" indexed="false" stored="true" multiValued="true"/> <field name="submitter_name" type="text" indexed="false" stored="true" multiValued="true"/> <field name="submitter_email" type="string" indexed="false" stored="true" multiValued="true"/> <field name="submitter_date" type="tdate" indexed="false" stored="true" multiValued="true"/> <field name="submitter_comment" type="text" indexed="false" stored="true" multiValued="true"/> <field name="submitter_organization" type="text" indexed="false" stored="true" multiValued="true"/> <field name="organizers_name" type="text" indexed="false" stored="true" multiValued="true"/> <field name="organizers_email" type="string" indexed="false" stored="true" multiValued="true"/> <field name="speakers_name" type="text" indexed="false" stored="true" multiValued="true"/> <field name="speakers_email" type="string" indexed="false" stored="true" multiValued="true"/> <field name="host_institution_name" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="host_institution_description" type="text" indexed="false" stored="true" multiValued="true"/> <field name="host_institution_url" type="string" indexed="false" stored="true" multiValued="true"/> <field name="sponsor_name" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="sponsor_description" type="text" indexed="false" stored="true" multiValued="true"/> <field name="sponsor_url" type="string" indexed="false" stored="true" multiValued="true"/> <field name="venue" type="text_lowercase" indexed="false" stored="true"/> <field name="street_address" type="text_lowercase" indexed="false" stored="true"/> <field name="city" type="text_lowercase" indexed="false" stored="true"/> <field name="province" type="text_lowercase" indexed="false" stored="true"/> <field name="country" type="string" indexed="false" stored="true"/> <field name="postcode" type="string" indexed="false" stored="true"/> <field name="post_office_box" type="string" indexed="false" stored="true"/> <field name="url" type="string" indexed="false" stored="true"/> <field name="attachment" type="string" indexed="false" stored="true" multiValued="true"/> <field name="social_media" type="string" indexed="false" stored="true" multiValued="true"/> <field name="starts" type="tdate" indexed="false" stored="true"/> <field name="ends" type="tdate" indexed="false" stored="true"/> <field name="time_zone" type="text" indexed="false" stored="true"/> <field name="last_update" type="tdate" indexed="false" stored="true" multiValued="true"/> <field name="deadlines" type="tdate" indexed="false" stored="true" multiValued="true"/> <field name="registration_opens_date" type="tdate" indexed="false" stored="true" multiValued="true"/> <field name="acceptance_notification_date" type="tdate" indexed="false" stored="true" multiValued="true"/> <field name="type" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="topic" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="keyword" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="target_audience" type="text_lowercase" indexed="false" stored="true" multiValued="true"/> <field name="spotlight" type="boolean" indexed="false" stored="true"/> <field name="latitude" type="double" indexed="true" stored="true"/> <field name="longitude" type="double" indexed="true" stored="true"/> <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> <field name="_version_" type="long" indexed="true" stored="true"/>
Conclusion:
While using newly proposed scheme iAnn will have a possibility to serve even more diverse data. This will serve as a basis for new services on top of Solr search engine. On top of that, migration to newer version of Solr will make it safer.
GitHub Repository
GitHub : GIT REPOSITORY
Link to proposal
Google Docs: PROPOSAL DOCUMENT
Deliverable 2: Migration scripts
We divided our scripting in four parts/scripts:
- select.php - Script for fetching the data from old schema collection.
- save.php - Script for Saving modified JSONs with new schema naming to disk.
- update.php - Script for preparing saved JSONs in a format for entry to database and indexing modified JSON files.
- config.php - Configuration script, which contains paths and shared variables used in other scripts.
First goal was to isolate functionality in each script, in such way we don't need to modify all scripts for further changes of new solr installations, i.e. solr 5.0. We might modify just the one which is being used for inserting new values - update.php. Second goal was to configure scripts on one place (config.php) and they should work out of the box.
Example list of saved indexes in JSON
Example of modified and saved JSON file content
Example of cached JSON file content: prepared for indexing
GitHub Repository
GitHub : SOLR SCHEMA MIGRATION GIT REPOSITORY
Link to Live Solr 4.10.2
Solr: LIVE SOLR EXAMPLE
Link to Migration User Interface
User Interface: LIVE SOLR SCHEMA MIGRATION USER INTERFACE
Deliverable 3: User Interface for controlling scripts (BONUS)
We divided our Web User interface in three parts:
- Configuration - for configuring scripts endpoints and uploading old and new schemas.
- Fetching - for getting the data from old solr collection.
- Migration - for modifying and migration to the new collection (reindexing).
This functionality is a bonus. The goals of Web user interface are:
- There is no single solution on the Web which allows us to go through solr migration with clicks. This was exactly the first goal, to create such tool.
- In such way user can configure everything in the browser and adapt migration on specific needs. Migration can be done by plan web user and not by programmer.
- We move complexity from the server to the browser: All logic, like type detection, conversion, visualisation etc. is in the browser. This is enabling efficient migration, since we don't overload server.
- Controlling and visualisation Logic is separated. It is RESTful architecture.
- Responsive UI.
- If types in new and old schema are the same then select dropdown has green border. It types are not same, but converted successfully a border is colored yellow. If types are nor same, nor successfully converted a border is colored in red.
- Additionally, it is possible to merge values of two different data sources into one (even if they are values of different types).
Example of Fetching and Saving page
Example of Migration page
GitHub Repository
GitHub : SOLR SCHEMA MIGRATION GIT REPOSITORY
Link to Live Solr 4.10.2
Solr: LIVE SOLR EXAMPLE
Link to Migration User Interface
User Interface: LIVE SOLR SCHEMA MIGRATION USER INTERFACE