An Open Source Platform for Rules-Based Classification of News Content
Taxonomies are used by many news organizations to classify their content. Classification is used in various ways, including to structure website navigation menus to help readers locate relevant content, to organize editorial workflows and to enrich search. News organizations classify news items using either automated or human means - and often a combination of the two.
Some organizations, such as The Associated Press, New York Times, Thomson Reuters and the BBC, have built their own taxonomies to classify news. Many news organizations rely on the IPTC’s Media Topics as the basis of their classification of content.
The IPTC taxonomies may be applied manually by journalists or by archivists, often with quite inconsistent results. For example, this study discusses manual classification of Spanish and Portugese newspaper archives.
There have been various attempts to automatically classify news content using IPTC taxonomies. In general, this involves software using natural language processing, semantic analysis and statistical pattern matching techniques to analyze each news item, and identify relevant metadata tags. Automated classification systems have been created for news in many languages, including Arabic, English, Catalan and Portuguese and Spanish.
Automated classification systems require regular maintenance in order to remain relevant and up-to-date. Most classification workflows allow for manual overrides to the resulting classification, with problematic documents being set aside for further review. New content must be incorporated into existing or new categories. Subject matter experts are required to review problematic documents and to tend to the needs of the automated classification systems, to prevent results degrading over time.
There are many toolkits and software systems for doing statistical classification of text content, such as OpenNLP, SRILM and NLTK. However, none of these toolkits support the rules-based approach we plan for EXTRA. One open source toolkit that may be useful for building EXTRA is the Elasticsearch Percolator, which is designed to efficiently match queries (rules) to documents (news items).
Two open source frameworks which may be useful ways to deploy EXTRA are GATE and UIMA. GATE and UIMA are designed to allow third-party components to be plugged into their standardized platforms, to make it easier for developers to discover and work with sophisticated linguistic analysis tools.