Using unsupervised NLP topic modelling and clustering to build a machine learning classifier that can identify misinformation in short-text documents

image by author


The internet — and social media, especially — is rife with political misinformation. One of the main challenges in fighting misinformation is the lack of labelled datasets, especially in low-resource languages. This project presents a unique hybrid approach to topic modelling and applies it to a dataset of 36+ million Arabic tweets, tweeted by users that have been flagged by Twitter as part of state-linked information operations. The hybrid topic clustering model is able to successfully extract the political content. …

A comparative analysis of two NLP topic modelling approaches for short-text documents, using Arabic Twitter data

image under license to author via iStock

In this article, I present a comparative analysis of two topic modelling approaches as applied to short-text documents, such as tweets: Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM). I explain the main differences in the algorithms, provide intuitions about how they operate under the hood, explain the pre-processing requirements for each, and evaluate their comparative performance on clustering varying amounts of short-text documents.

This post is an off-shoot of a larger project [LINK] in which I used Topic Modelling and Clustering to identify political misinformation content out of a dataset of 36 million Arabic-language tweets.



How to create delayed objects around functions you only want to run once per worker

Photo by Joshua Sortino on Unsplash


once_per_worker is a utility to create dask.delayed objects around functions that you only want to ever run once per distributed worker. This is useful when you have some large data baked into your docker image and need to use that data as auxiliary input to another dask operation (df.map_partitions, for example). Rather than transfer the serialised data between workers in the cluster — which will be slow because of the size of the data — once_per_worker allows you to call the parsing function once per worker, then use the same parsed object downstream.

See use case below.

Use Case

For my Arabic-Language…

Pre-processing Arabic text for machine-learning using the camel-tools Python package

image under license to Richard Pelgrim

In this article, I provide a concise and to-the-point overview of the challenges of working with Arabic text in NLP projects…and the tools available to overcome them. I rely heavily on the camel-tools Python package developed at the NYU Abu Dhabi CAMeL Lab and this excellent webinar by its director, Dr. Nizar Habash. Big shout-out to them for doing groundbreaking work in the field and making their tools accessible to the public!


Working with Arabic text in NLP projects presents (at least) 5 unique challenges:

  1. The form of characters and spelling of words can vary depending on their context (fancy…

It’s time for my second and final capstone project, with which I’ll be completing my Springboard Data Science Career Track. Cat believe I’m almost at the end of this thing already; these 6 months have paw-n by.

For my final project, I’m setting myself some technical challenges — things I want to learn that go beyond the curriculum. Specifically, I want to:

  1. Use distributed processing to work with larger-than-memory datasets hosted in the cloud,
  2. Work with Arabic Natural Language Processing, and
  3. To do that, I’ll have to wrap my brain cells around working with deep-learning networks using Tensorflow.

The Dataset

Large Arabic…

Data for Change

Building a machine learning model to predict the intensity of conflicts using a century of climate change data

Image via iStock under license to Richard Pelgrim

This story is part of a linked series tracking my progress through my first independent data science project. Find the previous post here and Jupyter Notebooks here.


Climate change is leading to increased political tensions and, some researchers speculate, is therefore driving increased armed conflict across the world. This project attempts to build a machine learning model to predict conflict intensity (measured as number of deaths per day) in India based on available Precipitation and Temperature data from the surrounding area (< 300km). The project concludes that it is not possible to accurately predict conflict intensity using local climate data

This story is part of a linked series tracking my progress through my first independent data science project. Find the previous post here, next post here, and Jupyter Notebooks here.

Last week, I officially hit the 50% mark on my Springboard Data Science Career Track curriculum. That means I’ve put in (at least) 300 hours of work so far.

So…what do you have to show for it, Richard?!

S this where I show off my fancy coding skills and rave on about all of the technical lingo I’ve mastered to prove to you that I really am one bad-ass panda-wrangling…

This story is part of a linked series documenting my progress through my first independent data science project. Find the previous post here and Jupyter Notebooks here.

I’ve got some first results to show! Very early days — I’m still mostly wrangling my datasets into shape — but I’ve got some maps; and as well all know, where there are maps there’s a good chance there will be………………

…dots. LOTS of dots.

Three-hundred-and-forty-thousand-four-hundred-and-sixty-nine dots, to be exact. The blue dots are all the GHCN weather stations; the orange ones are the UCDP conflict incidents. Even just eyeballing this, it’s clear that some geographical…

Yes, it’s true. I wrangle with Pandas. On the daily.

Except my Pandas are purely digital, imported into my digital wrangling environment with a simple line of code. I don’t even break a sweat.

And to make sure I really don’t exert myself too much here, I even chop a six-letter word into a two-letter abbreviation. That’s how lazy (accomplished!) a wrangler I have become in the span of just a few weeks.

But what about all that cute, fuzzy fur, I hear you ask? What’s the point of wrangling pandas if you can’t bury your face into that…

This week I’ll be starting work on my first independent data science project. After quite a few rabbit hole sessions throughout the internet, I’ve finally settled down on a topic: I’ll be exploring the correlations between incidents of armed conflict and measures of climate change.

For some background, here’s a short video describing a Stanford study on the topic, published last year. You can find links to an article about the study and the study itself at the end of this post.

Disclaimer: just want to put out there that I’m neither an expert on climate change nor on…

Richard Pelgrim

Data Scientist & Communicator | M.Sc. Human Geography & Planning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store