R Commands

A brief overview of common R commands for Introduction to Statistics.

Author

Affiliation

Jason M. Graham

University of Scranton

Published

Not published yet.

DOI

Some Useful R Commands

This notebook illustrates some R commands that relate directly to the statistical concepts covered in MATH 204 Introduction to Statistics. The presentation is terse and there is little discussion on how to interpret results as these details are covered in lectures.

Loading Packages

It is necessary to load any package (other than base R) that contains commands that we want to use. Note that these packages have to be installed before they can be loaded. One can install packages using the “Packages” tab and then clicking the “Install” icon.

# load packages
library(tidyverse)
library(openintro)
library(ggformula)
library(ggmosaic)
library(broom)
library(GGally)
library(skimr)

Here is a brief overview on each of the packages we have loaded:

tidvyverse contains commands for working with data frames and creating plots with ggplot,
openintro is an R package that is associated with our textbook OpenIntro Statistics. Mostly it just contains data sets.
ggformula, ggmosaic, and GGally each add enhanced plotting capabilities,
broom has commands that help clean up output from linear models,
skimr contains a command called skim that prints out a nice data summary.

Now that we’ve loaded our packages, we can import some data.

Load the Data

Here we load some data, the EPA 2021 vehicle data.

# load data
my_data <- read.csv("data/epa2021data.csv")
# if you don't have the data file, uncomment and use:
#my_data <- epa2021

This data contains vehicle information from the EPA for 2021. The size of this data set is reported by the following command:

dim(my_data)

[1] 1108   28

We see that there are 1108 observations (rows) and 28 variables (columns).

It is a good idea to try to learn more about our data set. One helpful command along these lines is the summary command that prints out summary statistics for each variable in the data set.

summary(my_data)

    model_yr      mfr_name           division        
 Min.   :2021   Length:1108        Length:1108       
 1st Qu.:2021   Class :character   Class :character  
 Median :2021   Mode  :character   Mode  :character  
 Mean   :2021                                        
 3rd Qu.:2021                                        
 Max.   :2021                                        
   carline            mfr_code         model_type_index
 Length:1108        Length:1108        Min.   :  1.0   
 Class :character   Class :character   1st Qu.: 40.0   
 Mode  :character   Mode  :character   Median :114.0   
                                       Mean   :259.4   
                                       3rd Qu.:507.0   
                                       Max.   :999.0   
 engine_displacement  no_cylinders    transmission_speed
 Min.   :1.000       Min.   : 3.000   Length:1108       
 1st Qu.:2.000       1st Qu.: 4.000   Class :character  
 Median :3.000       Median : 6.000   Mode  :character  
 Mean   :3.125       Mean   : 5.604                     
 3rd Qu.:3.800       3rd Qu.: 6.000                     
 Max.   :8.000       Max.   :16.000                     
    city_mpg        hwy_mpg        comb_mpg       guzzler         
 Min.   : 8.00   Min.   :13.0   Min.   :10.00   Length:1108       
 1st Qu.:17.00   1st Qu.:23.0   1st Qu.:19.00   Class :character  
 Median :20.00   Median :27.0   Median :22.50   Mode  :character  
 Mean   :20.89   Mean   :27.3   Mean   :23.29                     
 3rd Qu.:23.00   3rd Qu.:31.0   3rd Qu.:26.00                     
 Max.   :58.00   Max.   :60.0   Max.   :59.00                     
 air_aspir_method   air_aspir_method_desc transmission      
 Length:1108        Length:1108           Length:1108       
 Class :character   Class :character      Class :character  
 Mode  :character   Mode  :character      Mode  :character  
                                                            
                                                            
                                                            
 transmission_desc     no_gears      trans_lockup      
 Length:1108        Min.   : 1.000   Length:1108       
 Class :character   1st Qu.: 6.000   Class :character  
 Mode  :character   Median : 8.000   Mode  :character  
                    Mean   : 7.501                     
                    3rd Qu.: 8.000                     
                    Max.   :10.000                     
 trans_creeper_gear  drive_sys          drive_desc       
 Length:1108        Length:1108        Length:1108       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
  fuel_usage        fuel_usage_desc       class          
 Length:1108        Length:1108        Length:1108       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
  car_truck         release_date        fuel_cell        
 Length:1108        Length:1108        Length:1108       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character

One benefit of the summary command is that it helps us to identify the type of each variable. The glimpse command from the pillar package (which is installed as part of tidyverse) is also helpful in this regard.

pillar::glimpse(my_data)

Rows: 1,108
Columns: 28
$ model_yr              <int> 2021, 2021, 2021, 2021, 2021, 2021, 20…
$ mfr_name              <chr> "Honda", "aston martin", "aston martin…
$ division              <chr> "Acura", "Aston Martin Lagonda Ltd", "…
$ carline               <chr> "NSX", "Vantage Manual", "Vantage V8",…
$ mfr_code              <chr> "HNX", "ASX", "ASX", "VGA", "VGA", "VG…
$ model_type_index      <int> 39, 5, 4, 5, 7, 6, 8, 73, 352, 350, 75…
$ engine_displacement   <dbl> 3.5, 4.0, 4.0, 5.2, 5.2, 5.2, 5.2, 2.0…
$ no_cylinders          <int> 6, 8, 8, 10, 10, 10, 10, 4, 6, 4, 16, …
$ transmission_speed    <chr> "Auto(AM-S9)", "Manual(M7)", "Auto(S8)…
$ city_mpg              <int> 21, 14, 18, 13, 14, 13, 14, 23, 22, 25…
$ hwy_mpg               <int> 22, 21, 24, 20, 23, 20, 23, 31, 30, 32…
$ comb_mpg              <int> 21, 17, 20, 16, 17, 16, 17, 26, 25, 28…
$ guzzler               <chr> "N", "N", "N", "Y", "Y", "Y", "Y", "N"…
$ air_aspir_method      <chr> "TC", "TC", "TC", NA, NA, NA, NA, "TC"…
$ air_aspir_method_desc <chr> "Turbocharged", "Turbocharged", "Turbo…
$ transmission          <chr> "AMS", "M", "SA", "AMS", "AMS", "AMS",…
$ transmission_desc     <chr> "Automated Manual- Selectable (e.g. Au…
$ no_gears              <int> 9, 7, 8, 7, 7, 7, 7, 7, 8, 8, 7, 7, 8,…
$ trans_lockup          <chr> "Y", "N", "Y", "N", "N", "N", "N", "N"…
$ trans_creeper_gear    <chr> "N", "N", "N", "N", "N", "N", "N", "N"…
$ drive_sys             <chr> "A", "R", "R", "A", "R", "A", "R", "A"…
$ drive_desc            <chr> "All Wheel Drive", "2-Wheel Drive, Rea…
$ fuel_usage            <chr> "GPR", "GP", "GP", "GPR", "GPR", "GPR"…
$ fuel_usage_desc       <chr> "Gasoline (Premium Unleaded Required)"…
$ class                 <chr> "Two Seaters", "Two Seaters", "Two Sea…
$ car_truck             <chr> "car", "car", "car", "car", "car", "ca…
$ release_date          <chr> "2021-02-01T00:00:00Z", "2020-08-10T00…
$ fuel_cell             <chr> "N", "N", "N", "N", "N", "N", "N", "N"…

The next step in a data analysis is to conduct a so-called exploratory data analysis (EDA). This allows us to assess the data for any problems and also helps to suggest what types of statistical or analytical questions can be asked and/or answered.

The first steps in EDA are to compute summary statistics for variables of interest and to build appropriate summary visualizations.

Summary Stats for a Numerical Variable

For example, the comb_mpg variable (combined mileage) is a continuous numerical variable so it makes sense to compute its mean, median, standard deviation, and variance.

(comb_mpg_mean <- mean(my_data$comb_mpg))

[1] 23.2852

(comb_mpg_median <- median(my_data$comb_mpg))

[1] 22.5

(comb_mpg_sd <- sd(my_data$comb_mpg))

[1] 6.445931

(comb_mpg_var <- var(my_data$comb_mpg))

[1] 41.55002

Visual Summaries for Numerical Variables

Histograms and boxplots are appropriate ways to provide a visual summary for a numerical variable. The following plotting commands illustrate some of the ways in which such plots can be obtained.

gf_histogram(~comb_mpg,data=my_data)

Let’s change the number of bins and also add color to show where the bins are.

gf_histogram(~comb_mpg,data=my_data,bins=25,color="black")

A boxplot is another way to visualize numerical data.

gf_boxplot(~comb_mpg,data=my_data) + coord_flip()

Inference for Proportions and Means

The following subsections illustrate the R commands for conducting a statistical hypothesis test for a single proportion, a difference of proportions, a single mean, paired means, and a difference of two means. Only the syntax is illustrated, refer to the lectures for more context.

Single Proportion

prop.test(18,45,p=0.32,correct = FALSE)


    1-sample proportions test without continuity correction

data:  18 out of 45, null probability 0.32
X-squared = 1.3235, df = 1, p-value = 0.25
alternative hypothesis: true p is not equal to 0.32
95 percent confidence interval:
 0.2702489 0.5454815
sample estimates:
  p 
0.4

To conduct the same test “by hand,” we use

n <- 45
p_hat <- 18/n
p0 <- 0.32
SE <- sqrt((p0*(1-p0))/n)
(z_val <- (p_hat-p0)/SE )

[1] 1.150447

(p_val <- 2*(1-pnorm(z_val))) # since 2-sided and z_val > 0

[1] 0.2499596

Difference of Proportions

prop.test(c(18,15),c(45,27),correct=FALSE)


    2-sample test for equality of proportions without continuity
    correction

data:  c(18, 15) out of c(45, 27)
X-squared = 1.6448, df = 1, p-value = 0.1997
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.39138965  0.08027854
sample estimates:
   prop 1    prop 2 
0.4000000 0.5555556

To conduct the same test “by hand,” we use

n1 <- 45
n2 <- 27
p_hat1 <- 18/n1
p_hat2 <- 15/n2
p_pooled <- (p_hat1*n1 + p_hat2*n2)/(n1+n2)
SE <- sqrt((p_pooled*(1-p_pooled))/n1 + (p_pooled*(1-p_pooled))/n2)
(z_val <- (p_hat1-p_hat2)/SE)

[1] -1.28248

(p_val <- 2*pnorm(z_val)) # since 2-sided and z_val < 0

[1] 0.1996743

Single Mean

sim_data <- tibble(x=rnorm(18,2,0.75),y=rnorm(18,2.2,0.75))

t.test(sim_data$x,mu=1)


    One Sample t-test

data:  sim_data$x
t = 4.9179, df = 17, p-value = 0.0001302
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
 1.414160 2.036512
sample estimates:
mean of x 
 1.725336

To conduct this test “by hand”, use

n <- length(sim_data$x)
mu_x <- mean(sim_data$x)
sd_x <- sd(sim_data$x)
SE <- sd_x/sqrt(n)
mu0 <- 1
(t_val <- (mu_x-mu0)/SE)

[1] 4.917873

(p_val <- 2*(1-pt(t_val,df=n-1))) # since 2-sided and t_val > 0

[1] 0.0001301682

Paired Samples

t.test(sim_data$x,sim_data$y,paired = TRUE)


    Paired t-test

data:  sim_data$x and sim_data$y
t = -1.2347, df = 17, p-value = 0.2337
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.6445010  0.1686316
sample estimates:
mean difference 
     -0.2379347

To conduct this test “by hand”, use

xy_diff <- sim_data$x - sim_data$y
n <- length(xy_diff)
mu_diff <- mean(xy_diff)
sd_diff <- sd(xy_diff)
SE <- sd_diff/sqrt(n)
(t_val <- mu_diff/SE)

[1] -1.234727

(p_val <- 2*pt(t_val,df=n-1)) # since 2-sided and t_val < 0

[1] 0.2337271

Difference of Means

z <- rnorm(22,2,0.75)

t.test(sim_data$x,z,paired=FALSE)


    Welch Two Sample t-test

data:  sim_data$x and z
t = 0.91667, df = 37.426, p-value = 0.3652
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2292265  0.6082575
sample estimates:
mean of x mean of y 
 1.725336  1.535820

To do this test “by hand”, use

n1 <- length(sim_data$x)
n2 <- length(z)
mu_1 <- mean(sim_data$x)
sd_1 <- sd(sim_data$x)
mu_2 <- mean(z)
sd_2 <- sd(z)
mu_diff <- mu_1 - mu_2
SE <- sqrt(sd_1^2/n1 + sd_2^2/n2)
(t_val <- (mu_diff)/SE)

[1] 0.9166679

(p_val <- 2*(1-pt(t_val,df=min(n1-1,n2-1)))) # since 2-sided and t_val > 0

[1] 0.3721383

Note: The results are slightly different from what we see with the t.test command. This is due to the fact that the t.test implementation utilizes a pooled standard deviation which is a special topic that we did not cover in lecture.

Summary Stats for a Categorical Variable

For a single categorical variable, a one-way table is used to summarize the counts for how many times each level of the variable occurs as an observation in the sample data. For example,

table(my_data$drive_desc)


   2-Wheel Drive, Front     2-Wheel Drive, Rear 
                    264                     276 
          4-Wheel Drive         All Wheel Drive 
                    180                     358 
Part-time 4-Wheel Drive 
                     30

The addmargins command totals the counts.

addmargins(table(my_data$drive_desc))


   2-Wheel Drive, Front     2-Wheel Drive, Rear 
                    264                     276 
          4-Wheel Drive         All Wheel Drive 
                    180                     358 
Part-time 4-Wheel Drive                     Sum 
                     30                    1108

Instead of counts, we can also compute proportions.

proportions(table(my_data$drive_desc))


   2-Wheel Drive, Front     2-Wheel Drive, Rear 
             0.23826715              0.24909747 
          4-Wheel Drive         All Wheel Drive 
             0.16245487              0.32310469 
Part-time 4-Wheel Drive 
             0.02707581

or with the total

addmargins(proportions(table(my_data$drive_desc)))


   2-Wheel Drive, Front     2-Wheel Drive, Rear 
             0.23826715              0.24909747 
          4-Wheel Drive         All Wheel Drive 
             0.16245487              0.32310469 
Part-time 4-Wheel Drive                     Sum 
             0.02707581              1.00000000

Visual Summaries for a Single Categorical Variable

A visualization that corresponds to a one-way table is provided by a bar plot.

gf_bar(~drive_desc,data=my_data)

Note that the x-axis labels are difficult to read. This can be addressed as follows.

gf_bar(~drive_desc,data=my_data) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

The proportions version of a bar plot is obtained with a mosaic plot.

ggplot(data = my_data) +
  geom_mosaic(aes(x = product(drive_desc), fill=drive_desc))

or with easier to read labeling:

ggplot(data = my_data) +
  geom_mosaic(aes(x = product(drive_desc), fill=drive_desc)) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Variable Associations

Sometimes we are interested in the association of two variables. There are three typical situations:

Two numerical variables. To study this, we use correlation, scatter plots, and regression.
One numerical variable (as response) and one categorical variable (as explanatory). To study this, we use grouped summaries, side-by-side boxplots, and two-sample t-tests (when categorical variable is binary) or ANOVA.
Two categorical variables. Here, we use two-way tables, mosaic plots, and chi-square tests.

Two Numeric Variables

Consider for example, the variables comb_mpg (combined mileage) and engine_displacement (engine displacement). We may want to know if fuel efficiency is realted to engine size.

The correlation of these two variables is

(R_val <- cor(my_data$engine_displacement,my_data$comb_mpg))

[1] -0.7027217

This value squared is :

R_val^2

[1] 0.4938177

Let’s examine the scatter plot for these variables:

# scatterplot for two numerical variables
gf_point(comb_mpg~engine_displacement,data=my_data)

If we want to include the regression line, it can be added as follows:

gf_point(comb_mpg~engine_displacement,data=my_data) %>%
  gf_lm()

The following code fits a linear model for these variables and returns the inferential statistics:

lm_fit <- lm(comb_mpg~engine_displacement,data=my_data)
summary(lm_fit)


Call:
lm(formula = comb_mpg ~ engine_displacement, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0920 -2.7068 -0.7068  1.9080 30.5539 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          33.8624     0.3503   96.68   <2e-16 ***
engine_displacement  -3.3852     0.1031  -32.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.588 on 1106 degrees of freedom
Multiple R-squared:  0.4938,    Adjusted R-squared:  0.4934 
F-statistic:  1079 on 1 and 1106 DF,  p-value: < 2.2e-16

We can clean this output up using the commands from the broom package. For example,

tidy(lm_fit)

# A tibble: 2 × 5
  term                estimate std.error statistic   p.value
  <chr>                  <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)            33.9      0.350      96.7 0        
2 engine_displacement    -3.39     0.103     -32.8 1.03e-165

outputs the parameter estimates, standard errors, and p-values for the intercept and slope parameters. The glance command provides additional information such as the value and the degrees of freedom df.residual (notice that this is where is the number of observations).

glance(lm_fit)

# A tibble: 1 × 12
  r.squared adj.r.s…¹ sigma stati…²   p.value    df logLik   AIC   BIC
      <dbl>     <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.494     0.493  4.59   1079. 1.03e-165     1 -3259. 6524. 6539.
# … with 3 more variables: deviance <dbl>, df.residual <int>,
#   nobs <int>, and abbreviated variable names ¹adj.r.squared,
#   ²statistic

Residual plots are very helpful to assess potential issues with a linear fit. Here is code that produces a residual plot. Note that the first step is to obtain the residuals using the augment function.

lm_fit %>% augment() %>%
  gf_point(.resid~engine_displacement) %>% gf_hline(yintercept = 0,linetype="dashed")

Note that the residuals are not evenly distributed around the 0 line which indicates that a linear model is not appropriate for this data.

Numerical Response and Categorical Explanatory

How does the distribution of a numerical variable depend on the category to which the samples belong? For example, how does the combined gas mileage depend on what kind of drive system a vehicle uses?

We can compute grouped summary statistics. For example, the mean gas mileage per class of drive system is computed as follows:

my_data %>% group_by(drive_desc) %>% summarise(mean_comb_mpg=mean(comb_mpg))

# A tibble: 5 × 2
  drive_desc              mean_comb_mpg
  <chr>                           <dbl>
1 2-Wheel Drive, Front             29.9
2 2-Wheel Drive, Rear              20.6
3 4-Wheel Drive                    20.0
4 All Wheel Drive                  22.3
5 Part-time 4-Wheel Drive          21.3

A side-by-side boxplot allows us to see the distribution of a numerical variable by category.

gf_boxplot(comb_mpg~drive_desc,data=my_data)

gf_boxplot(comb_mpg~drive_desc,data=my_data) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Inference with ANOVA

In order to test if the observed difference in means is likely true, or is simply a result of sampling, we use an ANOVA. This is conducted as follows:

summary(aov(comb_mpg~drive_desc,data=my_data))

              Df Sum Sq Mean Sq F value Pr(>F)    
drive_desc     4  16025    4006   147.4 <2e-16 ***
Residuals   1103  29970      27                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Association for a Pair of Categorical Variables

A two-way or cross table compares combinations of categories for two categorical variables. For example, conside the data set hsb2 from the openintro package which records samples from the High School and Beyond survey, a survey conducted on high school seniors by the National Center of Education Statistics.

table(hsb2$ses,hsb2$prog)

        
         general academic vocational
  low         16       19         12
  middle      20       44         31
  high         9       42          7

We can also add the margin sum totals

addmargins(table(hsb2$ses,hsb2$prog))

        
         general academic vocational Sum
  low         16       19         12  47
  middle      20       44         31  95
  high         9       42          7  58
  Sum         45      105         50 200

Here ses is the socio economic status of student’s family and prog is the type of program.

A mosaic plot can be used to visualize this data.

ggplot(data = hsb2) +
  geom_mosaic(aes(x = product(ses, prog), fill=ses)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Inference with chi-square

We can ask if the variables ses and prog are independent. A chi-square test helps to answer this.

chisq.test(hsb2$ses,hsb2$prog)


    Pearson's Chi-squared test

data:  hsb2$ses and hsb2$prog
X-squared = 16.604, df = 4, p-value = 0.002307

Working with Distributions.

In order to conduct hypothesis test “by hand”, it helpful to be able to work with the basic distribution functions in R. For each distribution there are four functions:

r_dist_name which draws random samples
d_dist_name which represents the density function or probability mass function
p_dist_name which represents the cumulative distribution function, this is used to compute left-tail areas
q_dist_name is the quantile function which is the inverse function of p_dist_name

Suppose for example that we want to work with a normal distribution. Then we replace _dist_name with norm. For example,

(sim_x <- rnorm(75,10,2.5))

 [1] 14.014774  7.105479 11.641471 16.372478  9.913099  8.325916
 [7]  9.980988 14.442711  7.153481 13.419568 13.323912 10.841182
[13] 10.017232  8.861328  9.083690 11.620716 15.175677  9.616504
[19]  6.523248  8.191046 10.645654  9.207352  9.555525  9.575015
[25]  6.569245  9.565532 12.125581 11.744022 11.374993  8.993170
[31]  9.521016  7.013680  9.867103 10.637990 14.264910 12.503783
[37]  8.761041 10.888876  7.163480 12.195509 12.432292 15.302793
[43] 11.036309  8.813204 10.164984  8.743806  7.935004 10.417473
[49]  7.759338 10.420463 10.887421  9.869737  9.510163  8.377326
[55]  7.225582 12.123186 10.055906 12.077852  6.889280 10.422566
[61] 11.682916  9.934309  9.521520  8.045233 15.145405 11.876254
[67] 14.560521 10.200149  8.421477  6.216780  8.409750 10.565754
[73] 12.534226 10.631875  7.070129

Let’s make a histogram of this data:

gf_histogram(~sim_x,binwidth = 1.0,color="black")

Here are illustrations of the uses of the other distribution functions:

The value of the standard normal density function at

dnorm(0.0)

[1] 0.3989423

Let’s plot the standard normal density curve:

gf_dist("norm")

The use of the p and q functions are illustrated with

pnorm(1.5)

[1] 0.9331928

and

qnorm(0.3)

[1] -0.5244005

Note Depending on the type of distribution you want to work with, you need to specify appropriate values for an parameters. For example, to work with a t distribution to need to specify the appropriate values for the degrees of freedom df.

Two Bonus Commands

Data Summary with skim

An even fancier summary of our data is provided by the skim function from the skimr package. Let’s see how it works.

skimr::skim(my_data)

Table 1: Data summary
Name	my_data
Number of rows	1108
Number of columns	28
_______________________
Column type frequency:
character	20
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
mfr_name	0	1.00	3	20	22
division	0	1.00	3	32	41
carline	0	1.00	2	32	788
mfr_code	0	1.00	3	3	22
transmission_speed	0	1.00	8	12	25
guzzler	0	1.00	1	1	2
air_aspir_method	439	0.60	2	2	4
air_aspir_method_desc	0	1.00	5	25	5
transmission	0	1.00	1	3	7
transmission_desc	0	1.00	6	65	7
trans_lockup	0	1.00	1	1	2
trans_creeper_gear	0	1.00	1	1	2
drive_sys	0	1.00	1	1	5
drive_desc	0	1.00	13	23	5
fuel_usage	0	1.00	1	3	6
fuel_usage_desc	0	1.00	28	42	6
class	0	1.00	10	36	22
car_truck	507	0.54	1	3	3
release_date	0	1.00	20	20	159
fuel_cell	856	0.23	1	1	1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
model_yr	1	2021.00	0.00	2021	2021	2021.0	2021.0	2021	▁▁▇▁▁
model_type_index	1	259.36	271.47	1	40	114.0	507.0	999	▇▂▂▂▁
engine_displacement	1	3.12	1.34	1	2	3.0	3.8	8	▇▇▂▂▁
no_cylinders	1	5.60	1.92	3	4	6.0	6.0	16	▇▇▁▁▁
city_mpg	1	20.89	6.60	8	17	20.0	23.0	58	▆▇▁▁▁
hwy_mpg	1	27.30	6.42	13	23	27.0	31.0	60	▃▇▃▁▁
comb_mpg	1	23.29	6.45	10	19	22.5	26.0	59	▃▇▂▁▁
no_gears	1	7.50	1.87	1	6	8.0	8.0	10	▁▁▃▇▃

Essentially what this shows is that we have 20 categorical variables and 8 numerical variables. For each of the numerical variables, the skim function reports all of the quantitative summary statistic values.

Mulitple Scatter Plots with ggpairs

We might want to study the associations of multiple numerical variables simultaneously. The ggpairs command from the GGally package is a way to obtain a lot of information about several numerical varaibles with one plot. For example,

GGally::ggpairs(my_data[,c(6,7,8,12)])

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".