Politics on Twitter
Politics is being, sometimes viciously, battled on social media, especially Twitter. Politicians are trying to reach as many people as possible and they’re doing it with different approaches. Some concentrating on a particular agenda, others having a mostly negative message to spread.
Estimating the number of retweets/ likes that a tweet will get is definitely an extremely difficult, probably impossible, task however, having an estimate even if it’s a rough one might be really helpful to understand what is guiding voters and how to get their attention.
What makes the voters tick? Is it a particular policy agenda (e.g immigration, education, etc), are they more receptive over the weekend or during the week, do they respond more to negative or positive messages. These are some of the questions that will try to understand by building a machine learning model to estimate both retweets and likes.
Luckily, there exist several political tweets dataset, the one used in this project can be found here. It has approximately 1,500,000 tweets from Senators and Representatives with fields including the tweet’s text, tokenized text, policy area amongst others.
A little bit of preprocessing was done to obtain all the information for the model. For this step, Twiter API was used to obtain the number of hashtags, mentions, presence of media, and the number of likes and retweets. Luckily the data had a tweet id field which made the process much easier. Another feature that was calculated was a sentiment analysis score of the tweet, more on this in the following section.
This is an example of the final data fields available for each tweet.
Before jumping to the main model, we first have to find a sentiment analysis model to score our tweets. Due to the sheer size, training the model with our date was infeasible for our timeline so we needed a pre-trained model. In this case, the Flair model proposed by Zalando research was used which offers a pre-trained model that categorizes a sentence in positive or negative and gives a confidence score. The confidence score was used as a proxy for how negative or positive a tweet was. This might not be ideal but it was what worked best in this case. If you are interested in a more detailed approach maybe you can use some of the models here.
The idea behind the model was to treat the problem as a Multi-Target Regression because of the interdependency of retweets and likes.
We first, have to decide what type of regressor we want to use. Here we implement an algorithm based on a Deep Stacked Regressor (DRS) that can be found here. If you are interested in MTR this article proposes an improved algorithm based on the same idea.
The idea behind DRS is to use several layers each of them estimating all the target variables (using any kind of regressor, here we used RF), in this case retweets and likes, and using each layer’s prediction as a feature of the next layer. In this way, we hope to exploit the relationship between likes and retweets.
The algorithm follows these steps, first use DRS to find predictions for all targets, find the layer that is giving the best prediction, incorporate that prediction to the features and save the DRS’ models up to that layer, and then repeat for the following targets.
Here is some code of the DRS implementation, you could play around with the type of regressor or it’s parameters:
For our results we are going to use the performance of a Random Forest regressor, using the same parameters as in the DRS model.
Voila! Our approach worked, once the final prediction of the first target (always retweets because the numbers were lower, therefore, lower RMSE) is included in the initial input features the second target gets a boost on performance, which might not be jaw-dropping but is still an improvement.
Let's see if the sentiment analysis score was of any use.
The sentiment analysis calculated with the pre-trained model by Flair was the second most important feature (after followers count, which was expected) especially taking into account how rudimentary was this score.
Overall, it was better to tweet on either Saturday or Sunday, the policy areas that translated in higher engagement were civil rights, immigration, and international affairs and the inclusion of mentions or hashtags was negative. Regarding differences between Democrats and Republicans, the most striking one is that more positive tweets translate into more retweets in the republican dataset while the opposite happened on the democrat dataset (maybe who was president during 2017–2019 had something to do with this…). Another difference is that while tweets regarding health and the environment provided increased popularity on democrats the opposite happened on republicans.
There are several directions to follow, one of those is to take a similar or refined approach to tweets made by political figures other than actual elected representatives like here e.g commentators, and see how they cater to their followers, one would expect a much less restrained and sometimes callous approach.
Another interesting approach would be to make a stronger NLP set of features, for example classifying the tweets for emotions (anger, fear, disgust, etc) instead of just negative and positive. This could share some light on how different sets of political figures pander to their followers.