It’s time for my second and final capstone project, with which I’ll be completing my Springboard Data Science Career Track. Cat believe I’m almost at the end of this thing already; these 6 months have paw-n by.
For my final project, I’m setting myself some technical challenges — things I want to learn that go beyond the curriculum. Specifically, I want to:
- Use distributed processing to work with larger-than-memory datasets hosted in the cloud,
- Work with Arabic Natural Language Processing, and
- To do that, I’ll have to wrap my brain cells around working with deep-learning networks using Tensorflow.
Large Arabic NLP datasets are not easy to come by, so it took some digging before I found what I was looking for.
In the end, I discovered the Twitter Information Report: a collection of accounts and tweets that have been removed by Twitter because they are considered to be part of state-linked ‘information operations’ (read: propaganda). I decided to focus on a subset of this dataset containing accounts and tweets linked to information operations promoting Saudi Arabian policies, published by Twitter in April 2020.
The dataset contains more than 35 million tweets (25+ GB) from ca. 4500 accounts, around 95% of which are in Arabic. Important to note here that it’s the accounts that have been identified as (potentially) being part of state-linked information operations, not necessarily the tweets themselves.
The Research Questions
Remember this one?
If I learned anything from my first capstone project, it is this: define your research question before diving into your data…and be specific.
So here goes:
- Given that the accounts and not necessarily the Tweets have been identified as compromised, can we identify just the politically-motivated ‘misinformation’ Tweets using unsupervised topic modelling?
- Using sentiment analysis, can we identify whether these politically-motivated misinformation Tweets are predominantly positively or negatively framed?
- Finally, using the labels generated by the topic modelling and the sentiment analysis as an additional feature, can we build a classifier that will identify Arabic-language ‘misinformation’ Tweets?
More Research Questions
Another thing I learned from the first project is to keep the scope of the project manageable and realistic. So I’m going to try really, really hard to leave it at just the three questions listed above.
But because we both know I can’t actually rein in my curiosity very well (at all!)…
…I’m gonna go ahead and also jot down some ‘bonus’ research questions that I’ll let myself get around to once I’ve completed the first three.
- Bonus I: Using network analysis, can we identify clusters of users (re)Tweeting similar content and views (topics + sentiments)?
- Bonus II: Using these clusters, can we then identify whether Tweets from different accounts are actually authored by the same agent / originating from the same source?
To make this project come together, I’ll be diving into working with some new tools.
I recently discovered Coiled Cloud, a Python library paired with a slickly designed and intuitive interface that lets you spin up and manage Dask clusters on AWS in less than 3 minutes. Their public Beta is free, so I’ve been having some fun spinning up 50-worker clusters and processing my 35+ million Tweets (~25GB+) in no-time. If you regularly work with larger-than-memory datasets, then stay tuned — I’ve got a how-to-get-started article on this coming out soon, too!
There’s lots of exciting work happening in the field of Arabic NLP (as well as other low-resource languages). One of the most promising developments is the release, just over a year ago, of AraBERT by the MIND Lab at the American University of Beirut. I’ll be using this model to analyse the Arabic Tweets — which will be my first foray into working with deep-learning networks using Tensorflow. May the BERT be with me! And if you have experience working with AraBERT and want to compare notes, please do get in touch — I’d love to hear from you!