In this article we will see how to create interactive networks using the R programming language.
You can play with the interactive network we will build at the following URL: https://felixanalytix.com/vis/marvel-network
We will follow the following steps to create our interactive network from scratch:
- Scrapping the Marvel data on Wikipedia with rvest.
- Data cleaning and pivoting using the tidyverse.
- Exploratory data analysis and quality check with tidygraph.
- Create the interactive Network with visNetwork.
To download all the code of this tutorial you can join my newsletter on my website www.felixanalytix.com. Note that once you subscribe to my newsletter, you will receive an automatic email for me with the URL of my GitHub account, where you can get the R script of this article.
You can also watch the full tutorial (with additional details) in my YouTube video:
Install and attach R packages
The first thing we want to do is to install the necessary R packages for this data analysis.
We will use the tidyverse, which is a collection of R packages to do data wrangling or visualization, rvest to do the web scrapping in a tidy way, tidygraph to do network analysis (also in a tidy way), and visNetwork to make the interactive network visualization.
Web Scraping Wikipedia using rvest
Now we want to scrap Marvel data from Wikipedia. After a quick research online I found have three nice tables on Wikipedia.
Below the first table of the most recurring characters in the phase 1 of the Marvel Cinematic Universe.
The table contains the main characters and the name of the movies. The idea is is to take this data and transform it in a tidy structure, i.e. the movies in a “movie” colomn and the Marvel character in a “character” column.
In your Web browser you can right click, select “inspects” (Q) and you should see that Wikipedia has a class
wikitable for its tables.
Using rvest we can scrap only the table from this specific class. We will do this operation for each Marvel cinematic universe phase, each having its own Wikipedia page.
We will create a function
get table which will extract the specific table from the Web page. The function reads the HTML page, keep all the nodes that have the class
wikitable, then extract all the tables using the “html_table” function from rvest with the empty characters as NAs.
We also want to pivot the data using the “pivot_longer” function with the movie and the actor as variables for the names on the values. We use some regular expressions to remove and clean the actor names as well as the character names. We also add additional spaces if needed.
We can now loop on each URL using the “map_dfr” function from the purrr package on each of the URLs. The “map_dfr” function will also join the three data frames into a single one, called here “df”.
Our dataset is structured in tidy way format, with NAs if the character is not in a specific movie. For example Hulk is not in “Iron Man” movie but it is of course in the movie “The Incredible Hulk”.
We also need to do some additional cleaning because there is a “c” character that comes sometimes after the name of the characters. The little “c” indicates an credited Cameo rule, meaning the character appeared but very briefly in the movie. Let’s remove all the Cameo characters using a regular expression (the “$” sign means that the “c” comes at the end of the string when we detect it, and we will replace it as an “NA” character.
We also rename the movie names when they have the same name as the Marvel characters, because we don’t want to have the character name that have the same name as the movie. That’s why I decided to rename the movie “Thor” as “Thor 1” and the “Iron Man” movie as “Iron Man 1”. We also want to remove all the characters that are not appearing in the movies using “filter(!is.na(Actor))”.
Quick Exploratory Data Analysis
Now let’s do a is a very quick exploratory data analysis. We can count the most recurring characters and the movies with the most Marvel characters using the “count()” function from dyplr.
We can use different metrics using the tidygraph R package. We first need to transform our data frame as a tbl_graph using the R “as_tbl_graph” function from the tidygraph R package.
The tidygraph R package allows to easily get some metrics such as degree, betweenness, closeness, etc. You can even access a variety of different metrics such as the page rank algorithm originally used by Google.
For example, we can analyze the nodes (by first activating the nodes), then get a centrality degree metric using “centrality_degree()”. This allows us to know which are the characters that are the most common.
We see that the numbers of degree for Iron Man on Captain America are 9, exactly as expected (we already saw this result using the “count()” function from dplyr).
I will not go further into the different metrics of the tidygraph. I could do another article if you’re interested in (let me know in the comments below). Feel free explore by yourself the other functions of the tidygraph package.
How to create the interactive network
Before building the interactive network we will add an additional “group” variable, which will allow us to visualize differently the Marvel characters and the movies in the interactive network. We can now transform our dataframe as a list using the “toVisNetworkData()” function.
Our “vis_network” object can now be read by the “visNetwork()” function to create automatically the interactive visualization. We can create this interactive Network by adding the nodes from this network to the “nodes” argument of the function “visNetwork()”, and the edges to the “edges” argument.
You can choose the “weight”, the “height” and even give it a title, i.e. “The Marvel Cinematic Universe Network”. We also add a random seed because the network structure is generated in a random way. I decided to add some icons to differentiate the movies and the characters using fontawsome icons. I also want to highlight the nearest edges or nodes when hovering my mouse using the “highlightNearest” argument.
You can see the result and play with this interactive network at the following URL: https://felixanalytix.com/vis/marvel-network
If you found this tutorial useful consider giving me a clap or even subscribe. You can download the full R script of this tutorial by joining my newsletter on www.felixanalytix.com.
See you in another tutorial, bye!