An Open Source Platform for Rules-Based Classification of News Content
The aim of the EXTRA project is to build and freely distribute an initial version of EXTRA, an open source rules-based classifier. In addition, we plan to create two rule sets which make use of the EXTRA platform, which will also be free and open source. Finally, our goal is to put in place the core technical and marketing foundation to make EXTRA a sustainable open source project.
To provide intelligent search, high-quality topical aggregations, subject-specific alerts and content analytics, many modern news publishers tag content items with relevant topics. To achieve scale and consistency in tagging operations, publishers often employ rule-based tagging software. Such software tags relevant topics by analyzing the text of a document using a set of human authored rules. For example, to identify the topic “Kate Middleton (Person)” a publisher might use the rule:
apply tag “Kate Middleton (Person)” if the document text contains any of the following phrases: “Duchess of Cambridge”, “Catherine Middleton”, “ Catherine Elizabeth Middleton”
News publishers have invested an enormous amount of manual effort to create, manage, and maintain sets of these kinds of rules. For example, over the last fifteen years, The New York Times metadata services team has created a rule set containing over half a million manually-crafted rules.
Creating and deploying such rule-sets requires significant investments in both costly software and specialized personnel. As such, only the largest publishers can afford to acquire and maintain such systems. A freely available open source rule-based information extraction and classification toolkit would – for the first time – put a powerful knowledge management tool into the hands of small-to-medium sized publishers and create a marketplace for the decades-long investment made by larger publishers in their rule-sets.
For this reason, the International Press Telecommunications Council (IPTC) proposes to build EXTRA. EXTRA is a rules-based, open source, multilingual information extraction platform. Additionally, to make EXTRA immediately useful to the news publishing community, the IPTC further proposes to create two suites of rules for classifying news documents into the IPTC Media Topics Taxonomy, aimed at two of the languages supported by the Media Topics. Developed over many years by the IPTC and used by several leading news providers, the IPTC Media Topics is an industry-standard taxonomy for classifying news documents by subject. The Media Topics are available in English, French, Spanish and German.
To accomplish these goals, the IPTC proposes to hire both a software development contractor and a linguist. The software contractor will develop the EXTRA engine, a software component that takes as input EXTRA rules and a text document and produces as output a list of rules matched by the text document and their corresponding topics. In developing the rules engine, every effort will be made to identify and build upon existing open source components. The IPTC believes that Elasticsearch Percolator shows great potential to be one such open source component. Other open source frameworks that may be relevant are Apache’s UIMA and Sheffield’s GATE (the General Architecture for Text Engineering). Similarly, the IPTC will explore how to make EXTRA compatible with modern cloud architectures, to simplify the deployment of the system for small-to-medium sized publishers.
The software contractor will also develop and deliver a formal specification for the EXTRA rules language. The linguist will then, based on this formal specification, develop two collections of rules for classifying documents into IPTC Media Topics. All of these items, rules engine, language specification and classification rules, will be openly developed on github.com and released under a permissive open source license.
More broadly, it is the IPTC’s hope that this project will catalyze a migration in the news publishing community away from expensive proprietary document classification systems and towards a common industry wide open source platform. The IPTC further hopes that the broad adoption of a common rules-based document classification platform will create a marketplace for the many rule sets developed by news publishers over the last several years. Lastly, The IPTC believes that a freely available document classification platform will provide great benefit to small-to-medium sized publishers. The cost of existing document classification technology and the lack of freely available classification rule sets makes it extremely difficult for all but the largest publishers to leverage this technology in their operations. As such, small-to-medium sized publishers face a challenge in providing their readership with the kinds of search and aggregation experiences typical of their larger peers.