Custom Document-Level Neural Categorization Over The News Using Web NGrams 3.0

Earlier this week we showed how to perform custom thematic categorization of news coverage using keyword searches of the new Web News NGrams 3.0 dataset. What about more advanced document categorization systems that rely on Naïve Bayes or, more commonly today, neural categorization models?

It turns out that a simple workflow allows the ngrams to yield strong results on existing fulltext categorization models right out of the box, by simply concatenating all of the snippets for a given article together. This will result in what amounts to gibberish to a human reader, since randomly ordered snippets from throughout the article are essentially mashed together, but topically it sufficiently resembles the article to enable document classification models to analyze the article and yield a categorization highly similar or identical to what it would yield on the original fulltext.

The workflow is as follows:

  • First, take all of the records for each URL and for each, concatenate pre+ngram+post, with a space in between each, followed by a comma and a space. Store these, along with the pre and post fields separately, in an array for each URL by decile. Each snippet is ended with a comma so that when they are concatenated together, unrelated words don't run together and form unintended phrases. Some models appear to yield better results with commas, but you can also try ending snippets with periods or carriage returns, depending on what works best with your model.
  • Once all records in the file have been processed, go back for each URL and loop through all of its snippets by decile. For each snippet, check if its pre or post fields appear in the concatenated text thus far for the URL yet and if not, append it. Repeat for every snippet. Remember that ngrams are computed by running a rolling window over the article, meaning there is heavy overlap in the snippets. This overlap check essentially skips ngrams that are too close to others in the text, minimizing the overlapping repeated text. Sorting the snippets by decile ensures that even while the text is gibberish, the topical focus of the beginning, middle and ends of it are highly similar to the original article.

To showcase this, we've created a simple Perl script that does all of this. Download "demo-doccategorize.pl" and make it executable, as well as installing several additional Perl modules and supporting tools:

apt-get -y install pigz
apt-get -y install curl
apt-get -y install libjson-xs-perl

The script categorizes each article using Google's Natural Language API, which requires the Cloud SDK ("gcloud") to be installed on your computer and you to have enabled the NLP API, which is a paid GCP service. Note that instead of the NLP API you could use any model, whether local using something like spaCy or cloud-hosted like AutoML Natural Language. In this case we use the gcloud interface to the API to make the demo simple to understand:

gcloud ml language classify-text --content-file=./tmp.txt > categories.json

It will automatically download the latest Web NGrams 3.0 and GDELT Article List files, compile their entities and write to the "./RESULTS/" subdirectory. Remember that not all minutes have data and thus you will typically see clusters of output files every 15 minutes, with gaps in between.

You can set this up to run in realtime each minute by following the cron instructions in our tutorial on keyword searching from earlier this week.

You can see the results of a few articles run through this pipeline below. The categorizations are seen at the end of each row.

{ "date": "2022-01-14T12:01:00.000Z", "url": "https://www.finanznachrichten.de/nachrichten-2022-01/54976830-rubicon-capital-advisors-rubicon-advises-elgin-energy-on-the-sale-of-the-uk-s-largest-portfolio-of-solar-pv-projects-004.htm", "domain": "finanznachrichten.de", "outletName": "FinanzNachrichten.de", "outletLogo": "https://fns1.de/g/favicon.ico", "outletTwitter": "", "title": "Rubicon Capital Advisors: Rubicon Advises Elgin Energy on the Sale of the UK's Largest Portfolio of Solar PV Projects", "image": "https://fns1.de/g/fb.png", "desc": "Rubicon Capital Advisors (\"Rubicon\"), one of the world's leading investment banking firms focused on the broader infrastructure, energy and digital sectors, announced its role as exclusive sell-side", "lang": "en", "author": "Business Wire" , "categories": "{\n \"categories\": [\n {\n \"confidence\": 0.95,\n \"name\": \"/Business & Industrial/Energy & Utilities/Renewable & Alternative Energy\"\n }\n ]\n}\n" }
{ "date": "2022-01-14T12:01:00.000Z", "url": "https://www.kpcnews.com/lifestyles/entertainment/article_255e9e61-a72a-59fb-ac59-6588ffe4d607.html", "domain": "kpcnews.com", "outletName": "KPCNews", "outletLogo": "https://www.kpcnews.com/content/tncms/site/icon.ico", "outletTwitter": "@kpcnews", "title": "Muse return with new single Won't Stand Down", "image": "https://bloximages.newyork1.vip.townnews.com/kpcnews.com/content/tncms/assets/v3/editorial/6/3b/63bf7d98-c5e9-5220-9f8e-982d75fe3df1/61e157b96f467.image.jpg?crop=1000%2C525%2C0%2C62&resize=1000%2C525&order=crop%2Cresize", "desc": "British rockers Muse are back with the new single and music video for, 'Wont Stand Down'.", "lang": "en", "author": "By Celebretainment" , "categories": "{\n \"categories\": [\n {\n \"confidence\": 0.94,\n \"name\": \"/Arts & Entertainment/Music & Audio/Rock Music\"\n }\n ]\n}\n" }
{ "date": "2022-01-14T12:01:00.000Z", "url": "https://www.stourbridgenews.co.uk/news/19847961.mark-cavendish-burglary-police-releases-images-men-want-speak/", "domain": "stourbridgenews.co.uk", "outletName": "Stourbridge News", "outletLogo": "https://www.stourbridgenews.co.uk/resources/icon/", "outletTwitter": "", "title": "Mark Cavendish burglary: Police releases images of men they want to speak to", "image": "https://www.stourbridgenews.co.uk/resources/images/13377340/", "desc": "A Louis Vuitton suitcase and two high-value watches were taken during the raid in Essex on November 27.", "lang": "en", "author": "" , "categories": "{\n \"categories\": [\n {\n \"confidence\": 0.67,\n \"name\": \"/Law & Government/Public Safety/Law Enforcement\"\n }\n ]\n}\n" }
{ "date": "2022-01-14T12:01:00.000Z", "url": "https://wxerfm.com/shows-ben-olson/", "domain": "wxerfm.com", "outletName": "104.5 & 96.1 The Point", "outletLogo": "https://media.socastsrm.com/wordpress/wp-content/blogs.dir/2284/files/2021/03/squareicon.png", "outletTwitter": "", "title": "Shows | Ben Olson", "image": "https://media.socastsrm.com/uploads/station/1958/fbShare.png?r=67830", "desc": "Weekdays, 5am - 10am; Saturdays, 10am - 2pm", "lang": "en", "author": "" , "categories": "{\n \"categories\": [\n {\n \"confidence\": 0.58,\n \"name\": \"/Arts & Entertainment/Music & Audio\"\n }\n ]\n}\n" }
{ "date": "2022-01-14T12:01:00.000Z", "url": "http://www.comunicati.net/comunicati/arte/musica/rock/753568.html", "domain": "comunicati.net", "outletName": "comunicati.net", "outletLogo": "", "outletTwitter": "", "title": "Benzene Market Size | Share, Growth, Demand, Industry Analysis, Key Player profile and Regional Outlook by 2030", "image": "", "desc": "Benzene is a toxic, flammable, and volatile hydrocarbon", "lang": "en", "author": "" , "categories": "{\n \"categories\": [\n {\n \"confidence\": 0.88,\n \"name\": \"/Business & Industrial/Chemicals Industry\"\n }\n ]\n}\n" }
{ "date": "2022-01-14T12:01:00.000Z", "url": "https://www.bozemandailychronicle.com/ap_news/international/ukraines-government-websites-targeted-in-a-hacking-attack/article_b4359288-47d0-57f0-b57e-3543b5b5c5b5.html", "domain": "bozemandailychronicle.com", "outletName": "Bozeman Daily Chronicle", "outletLogo": "https://www.bozemandailychronicle.com/content/tncms/site/icon.ico", "outletTwitter": "@bozchron", "title": "Ukraine's government websites targeted in a hacking attack", "image": "https://bloximages.chicago2.vip.townnews.com/bozemandailychronicle.com/content/tncms/assets/v3/editorial/6/f7/6f7ac277-47dc-58c6-8afb-a6b7b8e86f72/61e15f301d31b.image.jpg?crop=1763%2C926%2C0%2C124&resize=1200%2C630&order=crop%2Cresize", "desc": "KYIV, Ukraine (AP) — A number of government websites in Ukraine were temporarily down on Friday after a huge hacking attack, Ukrainian officials said.", "lang": "en", "author": "YURAS KARMANAU Associated Press" , "categories": "{\n \"categories\": [\n {\n \"confidence\": 0.65,\n \"name\": \"/News\"\n }\n ]\n}\n" }

Even just skimming the titles and descriptions and comparing them with the NLP API's assigned categories, you can see that this workflow works extremely well to assign document-level categories to each article. Yet, most impressively, remember that this is done entirely through the unordered ngram snippets – the model never has access to the original fulltext.

How do the results compare with those from the original fulltext?

When the NLP API is run over the original fulltext of "How the Supreme Court rules during a pandemic shows what may happen next" it yields the following categories:

{
  "categories": [
    {
      "confidence": 0.9,
      "name": "/News/Politics"
    },
    {
      "confidence": 0.66,
      "name": "/Law & Government/Government"
    },
    {
      "confidence": 0.55,
      "name": "/People & Society/Social Issues & Advocacy"
    }
  ]
}

Here are the results of the ngram-based category tags:

{
  "categories": [
    {
      "confidence": 0.77,
      "name": "/Law & Government/Legal"
    },
    {
      "confidence": 0.63,
      "name": "/News/Politics"
    },
    {
      "confidence": 0.6,
      "name": "/Law & Government/Government/Courts & Judiciary"
    }
  ]
}

The results are extremely similar, but actually in many ways capture the gist of the article better than the original fulltext. In place of "Social Issues & Advocacy" there is now "Courts & Judiciary" and in place of "Government" there is now "Legal" – both changes intuitively seem to capture the gist of the article better than the fulltext categories.

Similarly, when the NLP API is run over the original fulltext of "435-acre development planned in Rosemount on former site of U of M munitions plant" it yields the following categories:

{
  "categories": [
    {
      "confidence": 0.75,
      "name": "/Real Estate/Real Estate Listings"
    }
  ]
}

Here are the results of the ngram-based category tags:

{
  "categories": [
    {
      "confidence": 0.76,
      "name": "/Real Estate"
    },
    {
      "confidence": 0.59,
      "name": "/Business & Industrial/Construction & Maintenance"
    }
  ]
}

Once again, this intuitively reflects the article's focus better than the fulltext categories, since the article is not a real estate listing, and its primary focus is about the planned construction that will be occurring there.

When the NLP API is run over the original fulltext of "Volcano In Tonga's Hunga Ha'apai Island Erupts Again" it yields the following categories:

{
  "categories": [
    {
      "confidence": 0.73,
      "name": "/Sensitive Subjects"
    },
    {
      "confidence": 0.66,
      "name": "/Science/Earth Sciences/Geology"
    }
  ]
}

Here are the results of the ngram-based category tags:

{
  "categories": [
    {
      "confidence": 0.76,
      "name": "/Sensitive Subjects"
    },
    {
      "confidence": 0.56,
      "name": "/Science/Earth Sciences/Geology"
    }
  ]
}

In this case the results are nearly identical, with only modest changes in confidence levels.

When the NLP API is run over the original fulltext of "Telehouse expands in Southeast Asia with data centre in Thailand" it yields the following categories:

{
  "categories": [
    {
      "confidence": 0.75,
      "name": "/Internet & Telecom"
    },
    {
      "confidence": 0.54,
      "name": "/Business & Industrial"
    },
    {
      "confidence": 0.51,
      "name": "/Computers & Electronics"
    }
  ]
}

Here are the results of the ngram-based category tags:

{
  "categories": [
    {
      "confidence": 0.51,
      "name": "/Internet & Telecom/Service Providers"
    }
  ]
}

This shows the most dramatic change between fulltext and ngram categories of the four example articles, but once again, the ngram labels intuitively appear to better capture the overall gist of the article than the original fulltext labels. After all, the article is not about industrial applications or electronics, but rather about a new telecom service provider facility.

From these four examples, it is clear that this workflow, despite its trivial simplicity, allows robust application of neural document classification models to be applied to news coverage using just the ngrams dataset, yielding highly similar and, in many cases better, results than applying to the original fulltext of each article. Indeed, a production application could likely achieve even better results by fine-tuning each model to work specifically with the ngram text, but as the examples above show, you can achieve nearly identical results simply by taking your existing document classification models and applying them to the ngrams data!