An Open Source Platform for Rules-Based Classification of News Content
For news publishers who are dissatisfied with either hand-tagging documents or statistical approaches to automated tagging, EXTRA is an open source, rules-based, classification system for annotating news documents with high-quality subject tags, regardless of language. Such tags allow publishers to deliver a variety of valuable services including content recommendations, improved advertising targeting and subject-specific content streams, such as alerts and topic pages.
Unlike hand-tagging, EXTRA’s rules-based system will allow publishers to tag their news content with consistent tags, at speed and at scale. Unlike statistical approaches, which often require numerous annotated examples, EXTRA’s rules-based system allows publishers to rapidly adapt to breaking news and low-frequency topics. EXTRA’s use of finely tuned rules will avoid problems with ambiguity (“Police Can’t Stop Gambling”) and will precisely distinguish between similar topics, which are more challenging for statistical approaches.
To facilitate adoption and consistency, the IPTC will also create EXTRA extraction rules for tagging documents in two different languages with its industry standard Media Topics vocabulary.