Asst. Prof. Teerasak E-kobon
5 Feb 2021
This chapter will introduce graphical packages and libraries in R that are useful for data presentation and visualization. Well-design data presentation helps in understanding and communication of the data and analytic results. The plot() function is the basic built-in universal plotting function in R. The COVID-19 data deposited in the covid19.analytics package will be used as the model data example. The covid19.data() function allows the access of the worldwide COVID19 information. More details of this function can be read by using ?covid19.data command. The first example shows the beginning step off how to download, install, and obtain specific datasets within this package. The select dataset in this first example is another type of data object called time series which is used for storing time-dependent data such as the confirmed COVID19 cases per date.
Example 1 The installation of the covid19.analytics package and specific data acquisition
install.packages(“covid19.analytics”)library(covid19.analytics)
?covid19.data
#select only the time series data of the confirmed casescovid19.confirmed.cases <- covid19.data(“ts-confirmed”)
head(covid19.confirmed.cases)
tail(covid19.confirmed.cases) # select a part of the data table
covid19.confirmed.cases[,2]
covid19.confirmed.cases[1:5,6:10] nrow(covid19.confirmed.cases)
ncol(covid19.confirmed.cases)
In many cases, we might not need to visualize all data in the dataset but we rather want to select a subset of the dataset and display them by the graph plotting. Several arguments of the plot() function can be adjusted to make nice looking graphs as in the second example.
Example 2 The use of plot() function
dat = covid19.confirmed.cases[244,1:380]typeof(dat)
unlist(dat) plot(unlist(dat), type = “p”)
plot(unlist(dat), type = “l”)
plot(unlist(dat), type = “h”)
plot(unlist(dat), type = “b”)
plot(unlist(dat), type = “b”, col = “red”, lwd = 2, pch = 17, xlab = “Days”, ylab = “Number of the cases”)
legend(“topleft”, legend = c(“Thailand”), col = c(“red”), pch = c(17), bty = “n”, pt.cex = 2, cex = 1.2, text.col = “black”, horiz = F , inset = c(0.1, 0.1))
Question 1 The “ts-confirmed” dataset in example 1 contains data of the confirmed cases of COVID19 in several countries worldwide updated until the current date (Febuary, 2021). If we would like to compare the number of confirmed cases over the past 30 days between the Southeast Asian countries, can you write the codes to visualize this selected data with the suitable graph types? What can you explain by looking at the plot?
The other popular package for the graphical plot in R is ggplot2 package. This package has a layer structure which allows users to add and adjust different components of the graph separately using the + operator and functions (ex. geom_point(), labs(), geom_line(), facet_wrap(), geom_bar(), geom_boxplot(), etc) with several arguments to customize the graph. The ggplot() function is mainly used for designing several graphical formats together with other accessory functions. Many plotting types are provided in this package such as boxplot, violin plot, dot plot, strip chart, histogram, density plot, scatter plot, bar plot, line plot, and pie chart. The third example demonstrates how to use functions in the ggplot2 package to visualize the covid19.confirmed.case data. The covid19.analytic package also provides its own analysis and visualization tools for the data as shown at the end of the third example.
Example 3 The use of ggplot() function for data visualization
library(ggplot2)length(unlist(dat)) #convert data type
unlist(dat)[5:380]
names(unlist(dat))[5:380] # create a new data frame
newTable = data.frame(date = names(unlist(dat))[5:380], number = unlist(dat)[5:380])
newTable
ncol(newTable)
rownames(newTable)
ggplot(newTable, aes(x=date, y=number)) + geom_point()
# create an object to store the plot resultg = ggplot(newTable, aes(x=date, y=number)) +
geom_point(col=”steelblue”, size=3) +
labs(title=”Confirmed cases in Thailand”, x=”Dates”, y=”Case number”)
plot(g) # subset new data from the whole data table
dataTH = covid19.confirmed.cases[244,1:380]
dataMa = covid19.confirmed.cases[175,1:380] # create another data frame
newTable2 = data.frame(date = names(unlist(dat))[5:380], TH = unlist(dataTH)[5:380], MA = unlist(dataMa)[5:380])
head(newTable2) ggplot(newTable2, aes(x=date)) +
geom_point(aes(y = TH), color = “darkred”) +
geom_point(aes(y = MA), color=”darkgreen”) +
labs(title=”bw Theme”, x = “Dates”, y = “Number”) +
theme(axis.text.x = element_blank(), axis.line = element_line(colour = “darkblue”, size = 1, linetype = “solid”)) # convert the covid19.confirmed.cases dataset to a data frame
newTable3 = data.frame(covid19.confirmed.cases)
head(newTable3)
nrow(newTable3)
ncol(newTable3)
newTable3[1:5,1:7] ?growth.rate
# read data for confirmed cases
data <- covid19.data(“ts-confirmed”)
head(data)
# compute changes and growth rates per location for ‘Thailand’
growth.rate(data,geo.loc=”Thailand”)
Question 2 For the covid19.confirmed.case dataset, if the user would like to see the table that shows the cumulative number of the COVID19 cases in each country available over the past 100 days, please write the codes to generate this required table and produce some graphic visualization that helps to understand the data.
Question 3 The covid19.analytics package also has other datasets for the death cases (ts-deaths) and recovered cases (ts-recovered). If we want to compare the number of deaths and recovered cases of at least ten Asian countries from the past 300 days, please write the codes to analyse and show the visualization of the result. Which are the top five Asian countries that have the highest number of deaths and recovered cases? Please produce the tables or graphs that explain your answer.
The data sets of the covid19.analytic package is mostly the time series data which can be analyzed by several specific objects and functions in R such as the ts object shown in the fourth example.
Example 4 The use of the time series data object in R
?tsdata1.ts = ts(unlist(data[1,5:100]), frequency = 1)
data2.ts = ts(unlist(data[2,5:100]), frequency = 1)
plot.ts(data1.ts)
plot.ts(data2.ts) par(mfrow = c(1,2))
plot.ts(data1.ts)
plot.ts(data2.ts) par(mfrow = c(1,1))
data1_2.ts <- cbind(data1.ts, data2.ts)
plot(aggregate.ts(data1_2.ts, FUN=sum), xlab = “Time (days)”, ylab = “Number of cases”, plot.type = “single”, col = c(1:3), lwd = 2)
title(“Number of confirmed cases”)
legend(“topleft”, c(“country 1”, “country 2”), lty = c(1,1,1), col = c(1:3), lwd = 2)
Question 4 The time series data records the data over a period of times and can be computed as the growth rate. Can you show how to write the R codes to calculate the growth rate of the covid19.analytic data? Although this package can compute the growth rate by the growth.rate() function, can you show the equations and write your own codes for the same computation?
Question 5 If we want to show the total number of confirmed cases by countries on the world map, can you propose the solutions or codes for this?
To the end of this chapter, students should be able to visualize the data and present the results with suitable graphic options or build their graphic ideas using the R codes.