Would you like to know how to extract the main emotions from any text and visualize its emotional trajectory using the R programming language?
And learn this in a fun way?
In this new tutorial, we will implement a simple-but-efficient method (a lexicon-based sentiment analysis) generally used to analyse consumer feedbacks, social medias or texts in surveys. But, to make it more fun, we will instead analyse the book Nineteen Eighty-Four from George Orwell.
If you prefer to watch this tutorial, here is my YouTube video:
The analysis in this tutorial is strongly inspired by the chapter “sentiment analysis with tidy data” from the excellent book Text Mining with R from Julia Silge. Give it a look if you want to dig further.
If you want to get the code of this tutorial you can join my newsletter on felixanalytix.com. Once you subscribed, you will receive an automatic email from me with the link of my GitHub account where you can download the code.
How to read a the text dataset in R
First we will install and attach the R packages tidytext for text mining and textdata to get sentiment lexicons, as well as the tidyverse for general data transformation and visualization.
The full text of the novel 1984 from George Orwell is accessible on the Gutenberg Australia website at the following URL. Using the function
read_lines from the readr R package loaded within the tidyverse, we can easily import the text into R. We will use the arguments of the functions to skip the empty rows, remove the metadata of the book by starting at the line 38 and remove the appending of the book starting at the line 8500.
head on “text_raw”, you can see the first sentences of the novel in the R console.
Creation of new text variables
As we plan to give an emotion or sentiment score by chapter, we will also add a new chapter variable in
text_raw . We will also create an index for each 50 lines of the book, so we can make a more detailed analysis of the emotions within each chapter.
How to get sentiment lexicons in R
We can easily load the three main sentiment lexicons to analysis the emotions in the novel.
The BING lexicon contains categories in a binary way words as “positive” or “negative”. The AFINN dictionary gives a gradual score between a minus 5 negative and plus 5 positive score. And finally the NRC lexicon classifies the words into categories of different emotions such as joy, sadness, anger, etc. as well as positive and negative words.
Limitations when using a sentiment lexicon
I know what you are probably thinking: using these lexicons to infer the emotions of a text is, at the very best, an dangerous simplification. And you would be right!
Here a quick example to show you an important limitation when using these lexicons. As you can see, the AFINN and BING lexicon classifies the two texts as positive, while obviously one is negative.
If these lexicons don’t even catch negations in texts, you can think of all the subtilities of the language they are not taking into account to detect emotions, such as irony or sarcasm. And if you have read the novel, you know that mentioning the “Ministry of Love” from example should actually not be classified as something positive.
However, as we will see, using these sentiment lexicons will allow us to have quite correct general overview of the main emotions of the novel, and even gives us a surprisingly good visualization of the general emotional trajectory of the book.
How to tokenize a text using R
Now we will tokenize our text data frame and join it with the lexicons.
The tokenization process refers to the fact of having a single word for each row of the data frame. This data transformation is done by the
unnest_tokens function of the tidytext package. Using the
inner_join function allows us to bind the words of the text with the lexicon as well as keeping at the same time only the words of the dictionary. Let’s make it for each lexicon.
General statistics by sentiment dictionnary
Let’s have a look at general statistics using each sentiment dictionary.
Using the BING text data frame we observe we have more negative than positive words in the novel. In the contrary, the NRC text data frame register more positive words, as well as identify a lot of words related to “trust” and “fear”. The AFINN text data frame gives a overall negative score of the book.
How to visualize the top words by emotion in R
We can now extract the top words by emotion.
Let’s visualize the top 10 words of the BING text data frame using ggplot2. We can recognize some general themes of the book.
We can do the same with the NRC text data frame, which gives a more detailed picture by emotion.
As you can see, the words labelled as “positive” and “negative” are different. Interestingly the category “anticipation” is prominent with the words “time” and “thought”. Indeed the notions of time and memory are very important in the novel. Overall we can recognize the most important topics of the book.
How to visualize the emotional trend of the text
Finally we will visualize the emotional trajectory of the text by chapter.
Let’s begin by calculating the emotional score by chapter simply using the
summarize function from dplyr on the AFINN text data frame.
The score by chapter reflets relatively well what happens in the book. The chapter 10 is the moment when the main character discovers the shop. And chapter 17 is the moment of the big revelation. Sorry for the small spoiler!
We can have an even thinner analysis by 50 paragraphs, using the “index” column we previously created.
We can see again the same general trend, but with more variety in the emotional scores. As we can observe, the emotional teneur of the paragraphs are rarely neutral, sometimes positive but general and increasingly negative. The visualization reflets again surprisingly well the general tones of the book, given our relatively basic analysis using the lexicons.
I want to conclude by highlighting once again the limitations of the use of the lexicons in this text analysis.
The sentiment dictionaries allow us to draw a very vague and approximative picture of the emotional teneur of a text. They can be misleading as they even fail to catch negations. More advanced machine learning algorithms should be used to get more accurate emotional detection in texts.
That being said, as we saw, the sentiment lexicon-based approach allowed use to correctly identify the general tone of the novel and provided a surprisingly informative overview of the emotional trajectory of the novel.
If you found this tutorial useful or if you have any question, please let it me know by giving me a clap or by writing a comment. You can download all the code of the tutorial by joining my newsletter on felixanalytix.com.
See you in another tutorial, bye bye!