Before starting this workshop we recommend you to download the following two files :
Data visualisation in Rstudio
Luisa M. Arias, Guillermo G., Pascal Nuijten
Before starting
As covered by the introduction, there are several things that you need to do for all the practicals to work: + Install R + Install RStudio + Install all the necessary packages
Introduction
RStudio is a must-know tool for everyone who works with the R programming language. It’s used in data analysis to import, access, transform, explore, plot, and model data, and even for machine learning to make predictions on data. RStudio is a flexible and multifunctional open-source IDE (integrated development environment) that is extensively used as a graphical front-end to work with the R programming language, giving it a user-friendly interface that makes programming more accessible.
If you’re just getting started with learning R, RStudio is definitely for you. So let’s get started with some basic functionalities of RStudio.
Creating a dataframe
What is a data frame? A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. It works very similar to a table, however aditionally to storing your data you can also make operations on it and visualize it.
In our case, we are going to have information on bird observations in a specific region in the Netherlands, including the name of the species that were spotted and the amount of times they were seen.
The two variables we just defined are vectors. Vectors are lists of values, and they are the simplest data structure in R.
Note that it’s necessary to place quotes around text (for the values under the “Name” column), but it’s not required to use quotes around numeric values (for the values under the “Observations” column).
df <- data.frame(Name, Observations)
print(df)
## Name Observations
## 1 C. ciconia 12
## 2 P. porzana 4
## 3 C. pugnax 16
## 4 L. marinus 6
## 5 G. stellata 10
You can also create data frames by importing data into R (Suggestion: save your Excel files as .csv before doing this)
df_from_excel <- read.csv("data_vis.csv", sep = ";")
Operations with our dataframe
Now that the data is loaded, we can proceed to do operations with it, or maybe even plot it, it all depends on what you aim to obtain with this analysis
For numerical operations, there are already several built-in functions in R. Let’s try to use some.
We could check how many observations there were in total, without discriminating between species:
total_Obs <- sum(df$Observations)
print(total_Obs)
## [1] 48
And now that we have a total, we could even make our observations a percentage instead of the count values we already have, and add that information to our table as a new column
df$Percentage <- df$Observations * 100 / total_Obs
After having done the operation, we can check our dataframe and see if it was actually added:
print(df)
## Name Observations Percentage
## 1 C. ciconia 12 25.000000
## 2 P. porzana 4 8.333333
## 3 C. pugnax 16 33.333333
## 4 L. marinus 6 12.500000
## 5 G. stellata 10 20.833333
Question
If you wanted to check that the Percentage column sums up to 100% (which would indicate the calculation was done correctly), how would you do it?
Making plots with our data
Many times, we prefer to summarize and present our findings in a visual way, be it for presentations at conferences or for publishing an article. Luckily, we can also do this in an easy way with RStudio.
Let’s try to plot the data we are working with now. For the bird observations, perhaps a bar plot would be a good way to start:
plot_bars <- barplot(df$Observations, names.arg = df$Name, main = "Bar plot of observed birds")
To summarize our results even further, we could even make a pie chart:
pie(df$Observations, labels = df$Name, main = "Pie chart of observed birds")
But here we are merely scratching the surface. So keep learning and trying new things with RStudio!
Example: Boxplot (https://r-graph-gallery.com/boxplot.html)
# Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(viridis)
## Loading required package: viridisLite
# create a dataset
data <- data.frame(
name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100) ),
value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) )
)
# Boxplot basic
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Basic boxplot") +
xlab("")
We can also perform some basic operations on numbers and safe them directly in variables :