IPTC EXTRA - The EXTraction Rules Apparatus

An Open Source Platform for Rules-Based Classification of News Content

High Level Plan for Initial Development and Rollout

We envision the following phases for the design and development of EXTRA. (The phases are in likely order of execution, but may well overlap).

Recruit EXTRA Team

Hire a developer experienced in natural language processing
Hire a linguist for writing the rules and testing the EXTRA platform
Gather Steering Committee of news publishers from IPTC and beyond

Evaluate existing open source projects and frameworks

Survey other open source efforts to see whether any could accelerate the development of EXTRA
Including GATE, UIMA, NLTK, OpenNLP, SRILM

Design and develop technical approach

Design high-level technical approach, select implementation technologies
Design EXTRA API for maintaining rule sets and classifying documents
Decide which two languages will be supported by the initial prototype
Assemble and annotate two test corpuses, one for each language, with desired taxonomy
Design the rule language and rule sets for applying the taxonomy to the two corpuses
Develop a minimum viable rules engine

Setup EXTRA as an open source project

Configure source code management for EXTRA on github
Publish documentation - project overview, contribution guidelines
Draft preliminary list of requirements and features
Agree on and publish license for EXTRA
Secure twitter handles, launchpad account and domain names
Set up an EXTRA email list and an EXTRA Slack

Develop EXTRA software and rule sets

Publicize releases of EXTRA platform and rule sets
Solicit, prioritize and implement features and bug fixes for EXTRA
Write guidebook for how to integrate EXTRA platform
Write guidebook for how to develop and test EXTRA rule sets

First non-core open source contributor

First production deployment