Advanced Data Filtering in R using dplyr

FelixAnalytix
7 min readMar 16, 2022

--

Would you like to learn how to filter() datasets with simple and more advanced examples?

In this tutorial we will start with 5 simple examples showing how to keep and removing rows of a dataset. Then you will see more advanced filtering methods such as:

  • using logical conditions on multiple columns of a dataset,
  • how to remove missing values using dplyr,
  • how to use specific words or strings to filter rows of character variables.

You can also watch this tutorial on YouTube:

Advanced Filtering using R

Download the R code of this tutorial by joining my newsletter on www.felixanalytix.com. You will receive an automatic email from me to access the R script.

Install and attach R packages

RStudio — Install and attach R packages

If you haven’t installed the dplyr and stringr R packages, you can run the lines 8 and 9 by clicking on the button “run” (highlighted). Then you can attach dplyr and stringr by running the lines 14 and 15.

The filter() function that I will use in this video is from the dplyr package, but we also need stringr for advanced filtering on character variables.

Demo dataset

In this tutorial we will use the dataset `starwars`, accessible when you attach dplyr. As you can see in the R Console of RStudio, the data contains information about 87 Starwars characters from the movie saga. For example, row 1 is about Luke Skywalker, with data about his height, mass, hair color, skin color, eye color, etc.

RStudio — Starwars dataset

How to filter one element

  • Which Starwars characters have/don’t have blue eyes?

Let’s start filtering this dataset with a simple example. Let’s say we want to keep only the Starwars characters that have blue eyes. We use the filter() function from dplyr. So we write “filter”, open parenthesis, call the `starwars` data for the argument, and for the second argument write the condition for the filtering. For this example we want that `eye_color` , the name of the column, equal, written two times `==` , the category “blue”.

Note that equality in R is written with the double equals signe ==. A single `=` is for assignment. If we run line 23, we see that 19 observations satisfy the condition `eye_color == “blue”` , as you can check by looking at the column `eye_color`.

RStudio —keep one element

Line 26 is equivalent of line 23, but it use the pipe operator, written %>%. The pipe operator allows write code in a way that is easier to read and to understand. When you run multiple functions on the same data, each data wrangling steps is easier to identify when looking at your code. We will see in the advanced filtering section of this video multiple functions chained using the pipe operator. If we run line 26, we have the same output as line 23.

Now let’s say we want to do this opposite: keep only the rows where `eye_color`is not equal to “blue”. We can use the operator `!=` , where `!` means “not”. We see in the R console of RStudio that indeed “Luke Skywalker” has been filtered out of the dataset.

RStudio — remove one element

How to filter multiple elements

  • Which characters have/don’t have blue and yellow eyes?

What if you not only want to keep “blue” eye color characters, but also the ones that have “yellow” eyes? You can use the `%in%` operator and create a vector containing the strings “blue” and “yellow” with the `c()` function. By running line 33 and 34, we will keep only the Starwars characters where `eye_color` is equal “blue” or “yellow”.

RStudio — filter multiple elements

If we want to do the opposite, and filter out all the characters that have “blue” or “yellow” eyes, we can add `!` before `eye_color` .

Filtering with logical operator

  • Which Starwars characters have a height greater than/less than 2 meters?

We saw how to filter categories. But what about numerical variables? Let’s say now that we want to keep only the Starwars characters taller that 2 meters. Using the variable `height`, we will use the `>` greater than sign to keep only the rows where `height` is greater than 200 centimetres.

RStudio — filtering using logical elements

We can do the opposite using the less than operator less than `<` to filter characters smaller than 100 centimetres. To have the list of all the relational operators in R, you can open the help documentation using the `?` by running line 48.

Filter using multiple variables

  • Which Starwars characters have blue eyes AND/OR blond hair?

What about if you want to filter data based on multiple variables? I will show you now how to use logical operator to filter data based on two conditions. Let’s say we want to know which characters have “blue” eyes and “blond” hair. We can use the “AND” conjunction operator `&` . As you can see in the R console of RStudio, we have only the characters that have blue eyes and have blond hair.

RStudio — filter multiple variables with “and” operator

A simpler way to use the AND logical operator is to use the coma `,` between two conditions. You can also use two times the `filter()` function to have the same result with the use of the pipe operator.

What about if you want all the characters that have blue eyes as well as all the characters that have blond hair? You have to use the disjunction operator OR `|`. If you look at the output in the R Console of RStudio, we have all the characters that have blue eyes as wells as the ones that have blond hair.

RStudio — filter with the logical operator “or”

Filter character variables

  • Which Starwars characters have/only have “blue” in eye_color?

What if you want to be more precise when filtering character variables? Maybe some Starwars characters have an eye color that contains the word “blue”, but are described differently. To keep all the observations that contain the string “blue” you can use the function `str_detect()` from the stringr R package. Let’s try it.

RStudio — filtering using str_detect

As you can see in the R console of RStudio, “Obi-Wan Kenobi” has “blue-grey” eyes. Do we have other Starwars characters that have eyes not described only as “blue” but still contains the word “blue”? To know that we will remove all the observations that are identical to “blue” using the “not equal” `!=` operator. As you can see, we also have the Starwars character named “R4-P17”.

Filter missing values

  • Which Starwars characters have/don’t have missing values (NA) in hair_color?

If you want to know which rows have missing values in a specific row, you can use the function `is.na()`. Let’s see if we have missing values in the variable named `hair_color`. Yes we have some.

RStudio — filtering missing values

What if you want to remove these NAs from your dataset? You can use the exclamation point `!` before the `is.na()` to keep all the observations that don’t have missing values.

Advanced filtering

  • Which Starwars characters are greater than the height mean (by species)?

Finally you can also use more advanced conditions within the filter() function. For example keep only the Starwars characters that have a height greater than the mean using the `mean`function.

RStudio — advanced filtering

If you want to be more specific and keep the Starwars characters that are greater that the height mean of their own species, you can use the `group_by` function the get the mean for each species. We grouped by “species”, but you could also try by gender to experiment.

Final Remarks

That’s all for this tutorial. If you are interested in more tutorials about dplyr functions, let me know in the comments below. If you have any questions or want to share other code examples, feel free to write them also in the comments.

Don’t forget to clap and subscribe to my newsletter!

--

--

No responses yet