IPTC EXTRA - The EXTraction Rules Apparatus

An Open Source Platform for Rules-Based Classification of News Content

High Level Plan for Initial Development and Rollout

We envision the following phases for the design and development of EXTRA. (The phases are in likely order of execution, but may well overlap).

Recruit EXTRA Team

  • Hire a developer experienced in natural language processing
  • Hire a linguist for writing the rules and testing the EXTRA platform
  • Gather Steering Committee of news publishers from IPTC and beyond

Evaluate existing open source projects and frameworks

  • Survey other open source efforts to see whether any could accelerate the development of EXTRA
  • Including GATE, UIMA, NLTK, OpenNLP, SRILM

Design and develop technical approach

  • Design high-level technical approach, select implementation technologies
  • Design EXTRA API for maintaining rule sets and classifying documents
  • Decide which two languages will be supported by the initial prototype
  • Assemble and annotate two test corpuses, one for each language, with desired taxonomy
  • Design the rule language and rule sets for applying the taxonomy to the two corpuses
  • Develop a minimum viable rules engine

Setup EXTRA as an open source project

  • Configure source code management for EXTRA on github
  • Publish documentation - project overview, contribution guidelines
  • Draft preliminary list of requirements and features
  • Agree on and publish license for EXTRA
  • Secure twitter handles, launchpad account and domain names
  • Set up an EXTRA email list and an EXTRA Slack

Develop EXTRA software and rule sets

  • Publicize releases of EXTRA platform and rule sets
  • Solicit, prioritize and implement features and bug fixes for EXTRA
  • Write guidebook for how to integrate EXTRA platform
  • Write guidebook for how to develop and test EXTRA rule sets

First non-core open source contributor

First production deployment