This is the story of four ADAnauts which, armed only of their laptops, embarked into a long and ambitious project, aiming to shine light on what it takes to be cited by other people.
What data we have got
-
Quotebank: A dataset collecting 178 million unique quotes extracted from 196 million English newspapers, published between January 2015 and April 2020. For each quote the dataset provides us with the text of the quote, the number of times it was encountered in newspapers, who is being quoted and more.
-
Wikidata: Have you ever heard of Wikipedia? Wikidata is like Wikipedia, but for computers (data is structured such that it can be easily retrieved without reading and understanding long paragraphs of text).
What we could have done
(Almost) anything! The newspaper data spans the years 2015-2020, so we could have chosen to study specific events such as the American Presidential Elections of 2016, we could have used it to find correlations between specific quotes and market volability, studied which newspapers cite who more often, …
What we chose to do
We were curious to see if it was possible to anticipate the quote virality from information about the quote itself and the speaker. Naturally this is a very difficult challenge because it involves both determining what parameters may have an impact, extract them, train machine learning models to do the prediction and more.
This project involves multiple stages, each with their own choices and challenges, which we will present to you one-by-one in the following Sections.
Why?
Answering this research question would allow us to build guidelines on what speaker characteristics people react more to, what topics, what sentiment (aggressive or positive), what length and more. And even if the models do not work as we expect them to, there is still a lot of intersting information to extract from failures, such is the essence of Computer Science!
Have a nice reading.