1. 程式人生 > >R for Hockey Analysis — Part 2: Tidyverse Basics

R for Hockey Analysis — Part 2: Tidyverse Basics

This is the second part of a series where I’ll show you some ways in which you can use R to analyze hockey data. If you haven’t seen the first tutorial, where I talk about R/RStudio installation and R basics, I think it’d be a good idea to check it out.

In this tutorial, we’ll play around with real 2018 NHL draft data, scraped from EliteProspects using my recently released R package,

elite.

But before we get started, we’re going to have to install the tidyverse.

Installing Packages

The lingo for installing packages in R is to run:

install.packages("name_of_the_package")

where you put the “name of the package” between quotes. Note that this works for packages housed on CRAN, which is a repository for R packages. My package, “elite”, isn’t on CRAN, so installation for “elite” and similar packages is slightly different, for whatever it’s worth.

Anyways, to install the tidyverse, we run:

install.packages("tidyverse")

You should see a lot of words in your console. That’s good! When that’s all finished, you should see something along the lines of:

package ‘tidyverse’ successfully unpacked...

If that’s what you see, great! If not, try again and possibly google-search an error message if R provided you with one.

Loading Packages

Once you install a package, you need to load it to actually use it.

Loading packages is nearly identical to installing packages. The lingo is:

library(name_of_the_package)

Where “name_of_the_package” is the name of the package you’re trying to load. Note that you don’t need quotes around the package name, but it would still work if you used quotes. Personally, I don’t use quotes around the package names for loading packages.

To load the tidyverse, you can run:

library(tidyverse)

And your output should look something like this:

> library(tidyverse)-- Attaching packages --------------------------------------- tidyverse 1.2.1 --v ggplot2 3.0.0     v purrr   0.2.5v tibble  1.4.2     v dplyr   0.7.6v tidyr   0.8.1     v stringr 1.3.1v readr   1.1.1     v forcats 0.2.0-- Conflicts ------------------------------------------ tidyverse_conflicts() --x dplyr::filter() masks stats::filter()x dplyr::lag()    masks stats::lag()Warning messages:1: package ‘tidyverse’ was built under R version 3.4.4 2: package ‘ggplot2’ was built under R version 3.4.4 3: package ‘tidyr’ was built under R version 3.4.4 4: package ‘purrr’ was built under R version 3.4.4 5: package ‘dplyr’ was built under R version 3.4.4 6: package ‘stringr’ was built under R version 3.4.4

It’s possible that you’ll have some of the warning messages like I have above; if so, that’s completely fine. R is just warning you that your version of R is not nearly as up-to-date as the version of R under which the tidyverse packages were built.

Wait — what the h*ck is the tidyverse?

I’m glad you asked! The tidyverse is a collection of packages that make working with data in R a lot easier than it could be. Each of the packages works in conjunction with the others, so when you load tidyverse, you actually load eight different packages at once.

The eight packages are:

  • ggplot2 — for visualizations
  • purrr — for working with functions
  • tibble — for working with a type of data frame called the tibble
  • dplyr — for data manipulation
  • tidyr — for reshaping your data
  • stringr — for working with strings
  • readr — for loading data
  • forcats — for working with a type of vector called “factors”

Awesome, so… what’s next?

Reading in Data

To use a dataset in R, you need to “read” it in some way. In this situation, “reading” means loading the dataset in R.

Bear in mind that, when you load a dataset in R, you’re never editing the actual dataset. By that, I mean that the original dataset will be unchanged no matter what you do in R.

To read in data, I use functions from the readr package from the tidyverse. These functions begin with read_.…. For example, these functions include read_csv, read_delim, and read_rds, all for reading in their respective file formats. For this tutorial, we’ll be reading in a file of the “.rds” format. And don’t worry if any of this file formatting talk is confusing — this is only a quick introduction.

So, let’s load the data. I made the dataset available on my GitHub. Download it, and copy/drag the dataset to your working directory. If you forgot, in the last tutorial, you set the working directory to whatever folder in which R will look for files that you want to load.

Once you copied the dataset to the working directory, we can load it into R with the read_rds() function. Let’s name this datasetdraft_data so that it’s easier to manipulate thereafter. Remember, to name something in R, we can use the assignment operator (<-).

Here’s the code to run:

draft_data <- read_rds("2018_nhl_draft_data.rds")

Great! And to check on the look of the data, let’s run draft_data:

> draft_data# A tibble: 217 x 16   draft_league draft_year pick_number round draft_team name    <chr>        <chr>            <dbl> <dbl> <chr>      <chr> 1 NHL Entry D~ 2018                 1     1 Buffalo S~ Rasm~ 2 NHL Entry D~ 2018                 2     1 Carolina ~ Andr~ 3 NHL Entry D~ 2018                 3     1 Montréal ~ Jesp~ 4 NHL Entry D~ 2018                 4     1 Ottawa Se~ Brad~ 5 NHL Entry D~ 2018                 5     1 Arizona C~ Barr~ 6 NHL Entry D~ 2018                 6     1 Detroit R~ Fili~ 7 NHL Entry D~ 2018                 7     1 Vancouver~ Quin~ 8 NHL Entry D~ 2018                 8     1 Chicago B~ Adam~ 9 NHL Entry D~ 2018                 9     1 New York ~ Vita~10 NHL Entry D~ 2018                10     1 Edmonton ~ Evan~# ... with 207 more rows, and 10 more variables:#   position <chr>, shot_handedness <chr>, birth_place <chr>,#   birth_country <chr>, birthday <chr>, height <dbl>,#   weight <dbl>, age <dbl>, player_url <chr>,#   player_statistics <list>

Woah. Wait. What? What’s a tibble?

Tibble

Remember the last tutorial where I wrote about the data frame? If you forgot, a data frame is basically a spreadsheet of data; it’s a collection of vectors of the equal length.

A tibble is a fancy type of data frame. The tibble has many advantages over the typical R data frame (created with data.frame()). In brief, the tibble is cleaner and more consistent than the typical R data frame is. From here on out, I’ll almost always be using tibbles rather than the typical R data frame.

Now, that’s great. But, what’s that <chr> and <dbl>near the top of the tibble?

Vector Classes

Do you remember, in last tutorial, when I taught you how to create a vector? We used the function c() to do this.

In case you forgot, here’s a vector of “goals” scored by three Rangers last season. Try running this:

goals <- c(16, 27, 25)

Cool. So what?

In R, vectors have classes, and these classes have implications for what you can do with those vectors.

Last tutorial, we also had a vector called assists for those same players. Let’s run this:

assists <- c(37, 20, 19)

And we added those 2 vectors together to get points, right?

> points <- goals + assists> points[1] 53 47 44

The reason why that vector addition works is because both goals and assists are of the same vector class; they’re both numeric (also known as double) vectors.

There are many types of vector classes in R, but the 3 vector classes that you’ll see most often are numeric vectors, character vectors, and logical vectors.

Numeric Vectors

  • Any numbers
  • EX: 1, 1.0001, 0

Character Vectors

  • Anything with quotes (“ “)
  • EX: "hockey", "NY rangers", "lias andersson > casey mittelstadt"

Logical Vectors

  • Generally, either a TRUE or a FALSE
  • EX: TRUE, FALSE

To see what vector class a vector is, you can use the function class()

Try these:

> class(c(1, 2, 3))[1] "numeric"> class(c(TRUE, FALSE))[1] "logical"> class(c("casey mittelstadt is not that amazing", "TRUE"))[1] "character"

And, in the last tutorial, vector addition worked with the goals and assists vectors because both vectors are of class numeric. It wouldn’t work if one vector were of class character and the other of class numeric.

We can even test this:

> goals <- c(16, 27, 25)> goals[1] 16 27 25> names <- c("Mats Zuccarello", "Mika Zibanejad", "Kevin Hayes")> names[1] "Mats Zuccarello" "Mika Zibanejad" [3] "Kevin Hayes"    > goals + namesError in goals + names : non-numeric argument to binary operator

While vector classes aren’t hugely important right now, they are often the cause of some very annoying errors in R — as you can see above — and it’s best to try and understand the concept of vector classes right away.

And let’s go back to the dataset — the tibble — that we loaded earlier, and see what it looks like:

> draft_data# A tibble: 217 x 16   draft_league draft_year pick_number round draft_team name    <chr>        <chr>            <dbl> <dbl> <chr>      <chr> 1 NHL Entry D~ 2018                 1     1 Buffalo S~ Rasm~ 2 NHL Entry D~ 2018                 2     1 Carolina ~ Andr~ 3 NHL Entry D~ 2018                 3     1 Montréal ~ Jesp~ 4 NHL Entry D~ 2018                 4     1 Ottawa Se~ Brad~ 5 NHL Entry D~ 2018                 5     1 Arizona C~ Barr~ 6 NHL Entry D~ 2018                 6     1 Detroit R~ Fili~ 7 NHL Entry D~ 2018                 7     1 Vancouver~ Quin~ 8 NHL Entry D~ 2018                 8     1 Chicago B~ Adam~ 9 NHL Entry D~ 2018                 9     1 New York ~ Vita~10 NHL Entry D~ 2018                10     1 Edmonton ~ Evan~# ... with 207 more rows, and 10 more variables:#   position <chr>, shot_handedness <chr>, birth_place <chr>,#   birth_country <chr>, birthday <chr>, height <dbl>,#   weight <dbl>, age <dbl>, player_url <chr>,#   player_statistics <list>

Do you notice the <chr> and <dbl>? Those are the vector classes! That’s part of why the tibble is nice — it tells you as what class each vector was loaded.

Awesome. Vector classes and data frames are a bit complicated, so it’s totally fine if you don’t get it right away. But if you do, fantastic! Now let’s finally get onto the best parts of the tidyv

Functions

erse… Sorry. Let’s talk about functions, what they are, and what you need to know about them at this moment. Sorry folks, we need to do this.

A function is a verb. It’s a way of saying “do this” or “get that” or “calculate these”. You supply an object — whether that be a data frame or a vector or something else — and the function does something.

There are literally thousands of functions in R. You can even create your own — I’ll show you how to do that in a later tutorial.

One function you’ve already used in this tutorial is c(), and that creates vectors. Another function you’ve used is class(). You’re basically an expert!

Now, what goes into a function? How do you know what you should be typing inside of c() to get it to work?

What you type inside of a function is called an argument, and the arguments that you need to supply a function depend on what the function is and what you want that function to do.

Let’s take a quick look at the arguments needed for class(). We can do this by looking at the documentation for the function. Trust me, there’s a reason behind all of this — I promise!

To look at the documentation for a function, you can run ?function_name, where function_name is the name of the function. So, to view the documentation for class(), you can run:

?class

This is what should pop up in the viewer (the bottom right quadrant):