Advanced Data Filtering in R using dplyr
In this article I will show you how to use the filter() function from the dplyr package. We will see different ways of keeping or removing specific rows of a dataset using 5 simple examples. Then we will explore more advanced filtering methods, such as filtering data using logical conditions on multiple columns, how to remove missing values or how to use specific words or strings to filter rows of character variables.
If you like “learning by doing” and want the code of this tutorial, you can join my newsletter on felixanalytix.com. After you confirmed your subscription, you will receive an email from me with the URL of my GitHub account where you can get the full script of this tutorial.
Install and attach R packages
If you haven’t installed the dplyr and stringr R packages, you can run the lines 8 and 9 by clicking on the button “run” (highlighted). Then you can attach dplyr and stringr by running the lines 14 and 15.
The filter() function that I will use in this video is from the dplyr package, but we also need stringr for advanced filtering on character variables.
In this tutorial we will use the dataset `starwars`, accessible when you attach dplyr. As you can see in the R Console of RStudio, the data contains information about 87 Starwars characters from the movie saga. For example, row 1 is about Luke Skywalker, with data about his height, mass, hair color, skin color, eye color, etc.
How to filter one element
- Which Starwars characters have/don’t have blue eyes?
Let’s start filtering this dataset with a simple example. Let’s say we want to keep only the Starwars characters that have blue eyes. We use the filter() function from dplyr. So we write “filter”, open parenthesis, call the `starwars` data for the argument, and for the second argument write the condition for the filtering. For this example we want that `eye_color` , the name of the column, equal, written two times `==` , the category “blue”.
Note that equality in R is written with the double equals signe
==. A single `=` is for assignment. If we run line 23, we see that 19 observations satisfy the condition `eye_color == “blue”` , as you can check by looking at the column `eye_color`.
Line 26 is equivalent of line 23, but it use the pipe operator, written
%>%. The pipe operator allows write code in a way that is easier to read and to understand. When you run multiple functions on the same data, each data wrangling steps is easier to identify when looking at your code. We will see in the advanced filtering section of this video multiple functions chained using the pipe operator. If we run line 26, we have the same output as line 23.
Now let’s say we want to do this opposite: keep only the rows where `eye_color`is not equal to “blue”. We can use the operator `!=` , where `!` means “not”. We see in the R console of RStudio that indeed “Luke Skywalker” has been filtered out of the dataset.
How to filter multiple elements
- Which characters have/don’t have blue and yellow eyes?
What if you not only want to keep “blue” eye color characters, but also the ones that have “yellow” eyes? You can use the `%in%` operator and create a vector containing the strings “blue” and “yellow” with the `c()` function. By running line 33 and 34, we will keep only the Starwars characters where `eye_color` is equal “blue” or “yellow”.
If we want to do the opposite, and filter out all the characters that have “blue” or “yellow” eyes, we can add `!` before `eye_color` .
Filtering with logical operator
- Which Starwars characters have a height greater than/less than 2 meters?
We saw how to filter categories. But what about numerical variables? Let’s say now that we want to keep only the Starwars characters taller that 2 meters. Using the variable `height`, we will use the `>` greater than sign to keep only the rows where `height` is greater than 200 centimetres.
We can do the opposite using the less than operator less than `<` to filter characters smaller than 100 centimetres. To have the list of all the relational operators in R, you can open the help documentation using the `?` by running line 48.
Filter using multiple variables
- Which Starwars characters have blue eyes AND/OR blond hair?
What about if you want to filter data based on multiple variables? I will show you now how to use logical operator to filter data based on two conditions. Let’s say we want to know which characters have “blue” eyes and “blond” hair. We can use the “AND” conjunction operator `&` . As you can see in the R console of RStudio, we have only the characters that have blue eyes and have blond hair.
A simpler way to use the AND logical operator is to use the coma `,` between two conditions. You can also use two times the `filter()` function to have the same result with the use of the pipe operator.
What about if you want all the characters that have blue eyes as well as all the characters that have blond hair? You have to use the disjunction operator OR `|`. If you look at the output in the R Console of RStudio, we have all the characters that have blue eyes as wells as the ones that have blond hair.
Filter character variables
- Which Starwars characters have/only have “blue” in eye_color?
What if you want to be more precise when filtering character variables? Maybe some Starwars characters have an eye color that contains the word “blue”, but are described differently. To keep all the observations that contain the string “blue” you can use the function `str_detect()` from the stringr R package. Let’s try it.
As you can see in the R console of RStudio, “Obi-Wan Kenobi” has “blue-grey” eyes. Do we have other Starwars characters that have eyes not described only as “blue” but still contains the word “blue”? To know that we will remove all the observations that are identical to “blue” using the “not equal” `!=` operator. As you can see, we also have the Starwars character named “R4-P17”.
Filter missing values
- Which Starwars characters have/don’t have missing values (NA) in hair_color?
If you want to know which rows have missing values in a specific row, you can use the function `is.na()`. Let’s see if we have missing values in the variable named `hair_color`. Yes we have some.
What if you want to remove these NAs from your dataset? You can use the exclamation point `!` before the `is.na()` to keep all the observations that don’t have missing values.
- Which Starwars characters are greater than the height mean (by species)?
Finally you can also use more advanced conditions within the filter() function. For example keep only the Starwars characters that have a height greater than the mean using the `mean`function.
If you want to be more specific and keep the Starwars characters that are greater that the height mean of their own species, you can use the `group_by` function the get the mean for each species. We grouped by “species”, but you could also try by gender to experiment.
That’s all for this tutorial. If you are interested in more tutorials about dplyr functions, let me know in the comments below. If you have any questions or want to share other code examples, feel free to write them also in the comments.
Don’t forget to clap and subscribe to my newsletter!