A brief overview of common R commands for Introduction to Statistics.
This notebook illustrates some R commands that relate directly to the statistical concepts covered in MATH 204 Introduction to Statistics. The presentation is terse and there is little discussion on how to interpret results as these details are covered in lectures.
It is necessary to load any package (other than base R) that contains commands that we want to use. Note that these packages have to be installed before they can be loaded. One can install packages using the “Packages” tab and then clicking the “Install” icon.
# load packages
library(tidyverse)
library(openintro)
library(ggformula)
library(ggmosaic)
library(broom)
library(GGally)
library(skimr)
Here is a brief overview on each of the packages we have loaded:
tidvyverse
contains commands for working with data frames and creating plots with ggplot
,
openintro
is an R package that is associated with our textbook OpenIntro Statistics. Mostly it just contains data sets.
ggformula
, ggmosaic
, and GGally
each add enhanced plotting capabilities,
broom
has commands that help clean up output from linear models,
skimr
contains a command called skim
that prints out a nice data summary.
Now that we’ve loaded our packages, we can import some data.
Here we load some data, the EPA 2021 vehicle data.
# load data
my_data <- read.csv("data/epa2021data.csv")
# if you don't have the data file, uncomment and use:
#my_data <- epa2021
This data contains vehicle information from the EPA for 2021. The size of this data set is reported by the following command:
dim(my_data)
[1] 1108 28
We see that there are 1108 observations (rows) and 28 variables (columns).
It is a good idea to try to learn more about our data set. One helpful command along these lines is the summary
command that prints out summary statistics for each variable in the data set.
summary(my_data)
model_yr mfr_name division
Min. :2021 Length:1108 Length:1108
1st Qu.:2021 Class :character Class :character
Median :2021 Mode :character Mode :character
Mean :2021
3rd Qu.:2021
Max. :2021
carline mfr_code model_type_index
Length:1108 Length:1108 Min. : 1.0
Class :character Class :character 1st Qu.: 40.0
Mode :character Mode :character Median :114.0
Mean :259.4
3rd Qu.:507.0
Max. :999.0
engine_displacement no_cylinders transmission_speed
Min. :1.000 Min. : 3.000 Length:1108
1st Qu.:2.000 1st Qu.: 4.000 Class :character
Median :3.000 Median : 6.000 Mode :character
Mean :3.125 Mean : 5.604
3rd Qu.:3.800 3rd Qu.: 6.000
Max. :8.000 Max. :16.000
city_mpg hwy_mpg comb_mpg guzzler
Min. : 8.00 Min. :13.0 Min. :10.00 Length:1108
1st Qu.:17.00 1st Qu.:23.0 1st Qu.:19.00 Class :character
Median :20.00 Median :27.0 Median :22.50 Mode :character
Mean :20.89 Mean :27.3 Mean :23.29
3rd Qu.:23.00 3rd Qu.:31.0 3rd Qu.:26.00
Max. :58.00 Max. :60.0 Max. :59.00
air_aspir_method air_aspir_method_desc transmission
Length:1108 Length:1108 Length:1108
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
transmission_desc no_gears trans_lockup
Length:1108 Min. : 1.000 Length:1108
Class :character 1st Qu.: 6.000 Class :character
Mode :character Median : 8.000 Mode :character
Mean : 7.501
3rd Qu.: 8.000
Max. :10.000
trans_creeper_gear drive_sys drive_desc
Length:1108 Length:1108 Length:1108
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
fuel_usage fuel_usage_desc class
Length:1108 Length:1108 Length:1108
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
car_truck release_date fuel_cell
Length:1108 Length:1108 Length:1108
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
One benefit of the summary command is that it helps us to identify the type of each variable. The glimpse
command from the pillar
package (which is installed as part of tidyverse
) is also helpful in this regard.
pillar::glimpse(my_data)
Rows: 1,108
Columns: 28
$ model_yr <int> 2021, 2021, 2021, 2021, 2021, 2021, 20…
$ mfr_name <chr> "Honda", "aston martin", "aston martin…
$ division <chr> "Acura", "Aston Martin Lagonda Ltd", "…
$ carline <chr> "NSX", "Vantage Manual", "Vantage V8",…
$ mfr_code <chr> "HNX", "ASX", "ASX", "VGA", "VGA", "VG…
$ model_type_index <int> 39, 5, 4, 5, 7, 6, 8, 73, 352, 350, 75…
$ engine_displacement <dbl> 3.5, 4.0, 4.0, 5.2, 5.2, 5.2, 5.2, 2.0…
$ no_cylinders <int> 6, 8, 8, 10, 10, 10, 10, 4, 6, 4, 16, …
$ transmission_speed <chr> "Auto(AM-S9)", "Manual(M7)", "Auto(S8)…
$ city_mpg <int> 21, 14, 18, 13, 14, 13, 14, 23, 22, 25…
$ hwy_mpg <int> 22, 21, 24, 20, 23, 20, 23, 31, 30, 32…
$ comb_mpg <int> 21, 17, 20, 16, 17, 16, 17, 26, 25, 28…
$ guzzler <chr> "N", "N", "N", "Y", "Y", "Y", "Y", "N"…
$ air_aspir_method <chr> "TC", "TC", "TC", NA, NA, NA, NA, "TC"…
$ air_aspir_method_desc <chr> "Turbocharged", "Turbocharged", "Turbo…
$ transmission <chr> "AMS", "M", "SA", "AMS", "AMS", "AMS",…
$ transmission_desc <chr> "Automated Manual- Selectable (e.g. Au…
$ no_gears <int> 9, 7, 8, 7, 7, 7, 7, 7, 8, 8, 7, 7, 8,…
$ trans_lockup <chr> "Y", "N", "Y", "N", "N", "N", "N", "N"…
$ trans_creeper_gear <chr> "N", "N", "N", "N", "N", "N", "N", "N"…
$ drive_sys <chr> "A", "R", "R", "A", "R", "A", "R", "A"…
$ drive_desc <chr> "All Wheel Drive", "2-Wheel Drive, Rea…
$ fuel_usage <chr> "GPR", "GP", "GP", "GPR", "GPR", "GPR"…
$ fuel_usage_desc <chr> "Gasoline (Premium Unleaded Required)"…
$ class <chr> "Two Seaters", "Two Seaters", "Two Sea…
$ car_truck <chr> "car", "car", "car", "car", "car", "ca…
$ release_date <chr> "2021-02-01T00:00:00Z", "2020-08-10T00…
$ fuel_cell <chr> "N", "N", "N", "N", "N", "N", "N", "N"…
The next step in a data analysis is to conduct a so-called exploratory data analysis (EDA). This allows us to assess the data for any problems and also helps to suggest what types of statistical or analytical questions can be asked and/or answered.
The first steps in EDA are to compute summary statistics for variables of interest and to build appropriate summary visualizations.
For example, the comb_mpg
variable (combined mileage) is a continuous numerical variable so it makes sense to compute its mean, median, standard deviation, and variance.
(comb_mpg_mean <- mean(my_data$comb_mpg))
[1] 23.2852
(comb_mpg_median <- median(my_data$comb_mpg))
[1] 22.5
(comb_mpg_sd <- sd(my_data$comb_mpg))
[1] 6.445931
(comb_mpg_var <- var(my_data$comb_mpg))
[1] 41.55002
Histograms and boxplots are appropriate ways to provide a visual summary for a numerical variable. The following plotting commands illustrate some of the ways in which such plots can be obtained.
gf_histogram(~comb_mpg,data=my_data)
Let’s change the number of bins and also add color to show where the bins are.
gf_histogram(~comb_mpg,data=my_data,bins=25,color="black")
A boxplot is another way to visualize numerical data.
gf_boxplot(~comb_mpg,data=my_data) + coord_flip()
The following subsections illustrate the R commands for conducting a statistical hypothesis test for a single proportion, a difference of proportions, a single mean, paired means, and a difference of two means. Only the syntax is illustrated, refer to the lectures for more context.
prop.test(18,45,p=0.32,correct = FALSE)
1-sample proportions test without continuity correction
data: 18 out of 45, null probability 0.32
X-squared = 1.3235, df = 1, p-value = 0.25
alternative hypothesis: true p is not equal to 0.32
95 percent confidence interval:
0.2702489 0.5454815
sample estimates:
p
0.4
To conduct the same test “by hand,” we use
n <- 45
p_hat <- 18/n
p0 <- 0.32
SE <- sqrt((p0*(1-p0))/n)
(z_val <- (p_hat-p0)/SE )
[1] 1.150447
(p_val <- 2*(1-pnorm(z_val))) # since 2-sided and z_val > 0
[1] 0.2499596
2-sample test for equality of proportions without continuity
correction
data: c(18, 15) out of c(45, 27)
X-squared = 1.6448, df = 1, p-value = 0.1997
alternative hypothesis: two.sided
95 percent confidence interval:
-0.39138965 0.08027854
sample estimates:
prop 1 prop 2
0.4000000 0.5555556
To conduct the same test “by hand,” we use
n1 <- 45
n2 <- 27
p_hat1 <- 18/n1
p_hat2 <- 15/n2
p_pooled <- (p_hat1*n1 + p_hat2*n2)/(n1+n2)
SE <- sqrt((p_pooled*(1-p_pooled))/n1 + (p_pooled*(1-p_pooled))/n2)
(z_val <- (p_hat1-p_hat2)/SE)
[1] -1.28248
(p_val <- 2*pnorm(z_val)) # since 2-sided and z_val < 0
[1] 0.1996743
One Sample t-test
data: sim_data$x
t = 4.9179, df = 17, p-value = 0.0001302
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
1.414160 2.036512
sample estimates:
mean of x
1.725336
To conduct this test “by hand”, use
n <- length(sim_data$x)
mu_x <- mean(sim_data$x)
sd_x <- sd(sim_data$x)
SE <- sd_x/sqrt(n)
mu0 <- 1
(t_val <- (mu_x-mu0)/SE)
[1] 4.917873
(p_val <- 2*(1-pt(t_val,df=n-1))) # since 2-sided and t_val > 0
[1] 0.0001301682
t.test(sim_data$x,sim_data$y,paired = TRUE)
Paired t-test
data: sim_data$x and sim_data$y
t = -1.2347, df = 17, p-value = 0.2337
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.6445010 0.1686316
sample estimates:
mean difference
-0.2379347
To conduct this test “by hand”, use
xy_diff <- sim_data$x - sim_data$y
n <- length(xy_diff)
mu_diff <- mean(xy_diff)
sd_diff <- sd(xy_diff)
SE <- sd_diff/sqrt(n)
(t_val <- mu_diff/SE)
[1] -1.234727
(p_val <- 2*pt(t_val,df=n-1)) # since 2-sided and t_val < 0
[1] 0.2337271
Welch Two Sample t-test
data: sim_data$x and z
t = 0.91667, df = 37.426, p-value = 0.3652
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2292265 0.6082575
sample estimates:
mean of x mean of y
1.725336 1.535820
To do this test “by hand”, use
n1 <- length(sim_data$x)
n2 <- length(z)
mu_1 <- mean(sim_data$x)
sd_1 <- sd(sim_data$x)
mu_2 <- mean(z)
sd_2 <- sd(z)
mu_diff <- mu_1 - mu_2
SE <- sqrt(sd_1^2/n1 + sd_2^2/n2)
(t_val <- (mu_diff)/SE)
[1] 0.9166679
[1] 0.3721383
Note: The results are slightly different from what we see with the t.test
command. This is due to the fact that the t.test
implementation utilizes a pooled standard deviation which is a special topic that we did not cover in lecture.
For a single categorical variable, a one-way table is used to summarize the counts for how many times each level of the variable occurs as an observation in the sample data. For example,
table(my_data$drive_desc)
2-Wheel Drive, Front 2-Wheel Drive, Rear
264 276
4-Wheel Drive All Wheel Drive
180 358
Part-time 4-Wheel Drive
30
The addmargins
command totals the counts.
addmargins(table(my_data$drive_desc))
2-Wheel Drive, Front 2-Wheel Drive, Rear
264 276
4-Wheel Drive All Wheel Drive
180 358
Part-time 4-Wheel Drive Sum
30 1108
Instead of counts, we can also compute proportions.
proportions(table(my_data$drive_desc))
2-Wheel Drive, Front 2-Wheel Drive, Rear
0.23826715 0.24909747
4-Wheel Drive All Wheel Drive
0.16245487 0.32310469
Part-time 4-Wheel Drive
0.02707581
or with the total
addmargins(proportions(table(my_data$drive_desc)))
2-Wheel Drive, Front 2-Wheel Drive, Rear
0.23826715 0.24909747
4-Wheel Drive All Wheel Drive
0.16245487 0.32310469
Part-time 4-Wheel Drive Sum
0.02707581 1.00000000
A visualization that corresponds to a one-way table is provided by a bar plot.
gf_bar(~drive_desc,data=my_data)
Note that the x-axis labels are difficult to read. This can be addressed as follows.
gf_bar(~drive_desc,data=my_data) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
The proportions version of a bar plot is obtained with a mosaic plot.
ggplot(data = my_data) +
geom_mosaic(aes(x = product(drive_desc), fill=drive_desc))
or with easier to read labeling:
ggplot(data = my_data) +
geom_mosaic(aes(x = product(drive_desc), fill=drive_desc)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Sometimes we are interested in the association of two variables. There are three typical situations:
Two numerical variables. To study this, we use correlation, scatter plots, and regression.
One numerical variable (as response) and one categorical variable (as explanatory). To study this, we use grouped summaries, side-by-side boxplots, and two-sample t-tests (when categorical variable is binary) or ANOVA.
Two categorical variables. Here, we use two-way tables, mosaic plots, and chi-square tests.
Consider for example, the variables comb_mpg
(combined mileage) and engine_displacement
(engine displacement). We may want to know if fuel efficiency is realted to engine size.
The correlation of these two variables is
(R_val <- cor(my_data$engine_displacement,my_data$comb_mpg))
[1] -0.7027217
This value squared is R2:
R_val^2
[1] 0.4938177
Let’s examine the scatter plot for these variables:
# scatterplot for two numerical variables
gf_point(comb_mpg~engine_displacement,data=my_data)
The following code fits a linear model for these variables and returns the inferential statistics:
Call:
lm(formula = comb_mpg ~ engine_displacement, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-9.0920 -2.7068 -0.7068 1.9080 30.5539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.8624 0.3503 96.68 <2e-16 ***
engine_displacement -3.3852 0.1031 -32.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.588 on 1106 degrees of freedom
Multiple R-squared: 0.4938, Adjusted R-squared: 0.4934
F-statistic: 1079 on 1 and 1106 DF, p-value: < 2.2e-16
We can clean this output up using the commands from the broom
package. For example,
tidy(lm_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 33.9 0.350 96.7 0
2 engine_displacement -3.39 0.103 -32.8 1.03e-165
outputs the parameter estimates, standard errors, and p-values for the intercept and slope parameters. The glance
command provides additional information such as the R2 value and the degrees of freedom df.residual
(notice that this is n−2 where n is the number of observations).
glance(lm_fit)
# A tibble: 1 × 12
r.squared adj.r.s…¹ sigma stati…² p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.494 0.493 4.59 1079. 1.03e-165 1 -3259. 6524. 6539.
# … with 3 more variables: deviance <dbl>, df.residual <int>,
# nobs <int>, and abbreviated variable names ¹adj.r.squared,
# ²statistic
Residual plots are very helpful to assess potential issues with a linear fit. Here is code that produces a residual plot. Note that the first step is to obtain the residuals using the augment
function.
lm_fit %>% augment() %>%
gf_point(.resid~engine_displacement) %>% gf_hline(yintercept = 0,linetype="dashed")
Note that the residuals are not evenly distributed around the 0 line which indicates that a linear model is not appropriate for this data.
How does the distribution of a numerical variable depend on the category to which the samples belong? For example, how does the combined gas mileage depend on what kind of drive system a vehicle uses?
We can compute grouped summary statistics. For example, the mean gas mileage per class of drive system is computed as follows:
# A tibble: 5 × 2
drive_desc mean_comb_mpg
<chr> <dbl>
1 2-Wheel Drive, Front 29.9
2 2-Wheel Drive, Rear 20.6
3 4-Wheel Drive 20.0
4 All Wheel Drive 22.3
5 Part-time 4-Wheel Drive 21.3
A side-by-side boxplot allows us to see the distribution of a numerical variable by category.
gf_boxplot(comb_mpg~drive_desc,data=my_data)
or
gf_boxplot(comb_mpg~drive_desc,data=my_data) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
In order to test if the observed difference in means is likely true, or is simply a result of sampling, we use an ANOVA. This is conducted as follows:
Df Sum Sq Mean Sq F value Pr(>F)
drive_desc 4 16025 4006 147.4 <2e-16 ***
Residuals 1103 29970 27
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A two-way or cross table compares combinations of categories for two categorical variables. For example, conside the data set hsb2
from the openintro
package which records samples from the High School and Beyond survey, a survey conducted on high school seniors by the National Center of Education Statistics.
table(hsb2$ses,hsb2$prog)
general academic vocational
low 16 19 12
middle 20 44 31
high 9 42 7
We can also add the margin sum totals
addmargins(table(hsb2$ses,hsb2$prog))
general academic vocational Sum
low 16 19 12 47
middle 20 44 31 95
high 9 42 7 58
Sum 45 105 50 200
Here ses
is the socio economic status of student’s family and prog
is the type of program.
A mosaic plot can be used to visualize this data.
ggplot(data = hsb2) +
geom_mosaic(aes(x = product(ses, prog), fill=ses)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
We can ask if the variables ses
and prog
are independent. A chi-square test helps to answer this.
chisq.test(hsb2$ses,hsb2$prog)
Pearson's Chi-squared test
data: hsb2$ses and hsb2$prog
X-squared = 16.604, df = 4, p-value = 0.002307
In order to conduct hypothesis test “by hand”, it helpful to be able to work with the basic distribution functions in R. For each distribution there are four functions:
r_dist_name
which draws random samples
d_dist_name
which represents the density function or probability mass function
p_dist_name
which represents the cumulative distribution function, this is used to compute left-tail areas
q_dist_name
is the quantile function which is the inverse function of p_dist_name
Suppose for example that we want to work with a normal distribution. Then we replace _dist_name
with norm
. For example,
(sim_x <- rnorm(75,10,2.5))
[1] 14.014774 7.105479 11.641471 16.372478 9.913099 8.325916
[7] 9.980988 14.442711 7.153481 13.419568 13.323912 10.841182
[13] 10.017232 8.861328 9.083690 11.620716 15.175677 9.616504
[19] 6.523248 8.191046 10.645654 9.207352 9.555525 9.575015
[25] 6.569245 9.565532 12.125581 11.744022 11.374993 8.993170
[31] 9.521016 7.013680 9.867103 10.637990 14.264910 12.503783
[37] 8.761041 10.888876 7.163480 12.195509 12.432292 15.302793
[43] 11.036309 8.813204 10.164984 8.743806 7.935004 10.417473
[49] 7.759338 10.420463 10.887421 9.869737 9.510163 8.377326
[55] 7.225582 12.123186 10.055906 12.077852 6.889280 10.422566
[61] 11.682916 9.934309 9.521520 8.045233 15.145405 11.876254
[67] 14.560521 10.200149 8.421477 6.216780 8.409750 10.565754
[73] 12.534226 10.631875 7.070129
Let’s make a histogram of this data:
gf_histogram(~sim_x,binwidth = 1.0,color="black")
Here are illustrations of the uses of the other distribution functions:
The value of the standard normal density function at z=0 isdnorm(0.0)
[1] 0.3989423
Let’s plot the standard normal density curve:
gf_dist("norm")
The use of the p
and q
functions are illustrated with
pnorm(1.5)
[1] 0.9331928
and
qnorm(0.3)
[1] -0.5244005
Note Depending on the type of distribution you want to work with, you need to specify appropriate values for an parameters. For example, to work with a t distribution to need to specify the appropriate values for the degrees of freedom df
.
An even fancier summary of our data is provided by the skim
function from the skimr
package. Let’s see how it works.
skimr::skim(my_data)
Name | my_data |
Number of rows | 1108 |
Number of columns | 28 |
_______________________ | |
Column type frequency: | |
character | 20 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
mfr_name | 0 | 1.00 | 3 | 20 | 0 | 22 | 0 |
division | 0 | 1.00 | 3 | 32 | 0 | 41 | 0 |
carline | 0 | 1.00 | 2 | 32 | 0 | 788 | 0 |
mfr_code | 0 | 1.00 | 3 | 3 | 0 | 22 | 0 |
transmission_speed | 0 | 1.00 | 8 | 12 | 0 | 25 | 0 |
guzzler | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
air_aspir_method | 439 | 0.60 | 2 | 2 | 0 | 4 | 0 |
air_aspir_method_desc | 0 | 1.00 | 5 | 25 | 0 | 5 | 0 |
transmission | 0 | 1.00 | 1 | 3 | 0 | 7 | 0 |
transmission_desc | 0 | 1.00 | 6 | 65 | 0 | 7 | 0 |
trans_lockup | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
trans_creeper_gear | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
drive_sys | 0 | 1.00 | 1 | 1 | 0 | 5 | 0 |
drive_desc | 0 | 1.00 | 13 | 23 | 0 | 5 | 0 |
fuel_usage | 0 | 1.00 | 1 | 3 | 0 | 6 | 0 |
fuel_usage_desc | 0 | 1.00 | 28 | 42 | 0 | 6 | 0 |
class | 0 | 1.00 | 10 | 36 | 0 | 22 | 0 |
car_truck | 507 | 0.54 | 1 | 3 | 0 | 3 | 0 |
release_date | 0 | 1.00 | 20 | 20 | 0 | 159 | 0 |
fuel_cell | 856 | 0.23 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
model_yr | 0 | 1 | 2021.00 | 0.00 | 2021 | 2021 | 2021.0 | 2021.0 | 2021 | ▁▁▇▁▁ |
model_type_index | 0 | 1 | 259.36 | 271.47 | 1 | 40 | 114.0 | 507.0 | 999 | ▇▂▂▂▁ |
engine_displacement | 0 | 1 | 3.12 | 1.34 | 1 | 2 | 3.0 | 3.8 | 8 | ▇▇▂▂▁ |
no_cylinders | 0 | 1 | 5.60 | 1.92 | 3 | 4 | 6.0 | 6.0 | 16 | ▇▇▁▁▁ |
city_mpg | 0 | 1 | 20.89 | 6.60 | 8 | 17 | 20.0 | 23.0 | 58 | ▆▇▁▁▁ |
hwy_mpg | 0 | 1 | 27.30 | 6.42 | 13 | 23 | 27.0 | 31.0 | 60 | ▃▇▃▁▁ |
comb_mpg | 0 | 1 | 23.29 | 6.45 | 10 | 19 | 22.5 | 26.0 | 59 | ▃▇▂▁▁ |
no_gears | 0 | 1 | 7.50 | 1.87 | 1 | 6 | 8.0 | 8.0 | 10 | ▁▁▃▇▃ |
Essentially what this shows is that we have 20 categorical variables and 8 numerical variables. For each of the numerical variables, the skim
function reports all of the quantitative summary statistic values.
We might want to study the associations of multiple numerical variables simultaneously. The ggpairs
command from the GGally
package is a way to obtain a lot of information about several numerical varaibles with one plot. For example,
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".