Sports Analytics in Practice with R. Ted KwartlerЧитать онлайн книгу.
an optional aesthetic is added declaring the `
size
` for each dot in the scatterplot. The code below adds another aspect to the `qplot
` to improve the overall look. Specifically, another “layer” is added from the `ggthemes
` library to adjust many parameters within a single function call. Here, the empty function `theme_hc
` emulates the popular “Highcharts” JavaScript theme. As is standard with `ggplot2
` objects, additional parameters such as aesthetics are added in layers using the `+
` sign. This is not the arithmetic addition sign merely an operator to append layers to `ggplot
` objects. Figure 1.8 is the result of the `qplot
` and `theme_hc
` adjustment using the Dallas basketball data to explore the relationship between average minutes per game and average points per game.
Figure 1.8 As expected the more minutes a player averages the higher the average points.
qplot(x = MPG_min_per_game, y = POINTS_PER_GAME, size = 5, data = nbaData) + theme_hc()
Let’s add a bit more complexity to the visualization by creating a heatmap. The heatmap chart has x and y axes but represents data amounts as color intensity. A heatmap allows the audience to comprehend complex data quickly and concisely. To begin, let’s use the `data.frame` function to create a smaller data set. Here, the column names are being renamed and each individual column from the `nbaData` object is explicitly selected. The new object has the same number of rows but a subset of the columns. There are additional functions to perform this operation but this is straightforward. As the book continues, more concise though complex examples will perform the same operation.
smallerStats <- data.frame(player = nbaData$ï.PLAYER, FTA = nbaData$FTA_free_throws_attempted, TWO_PA = nbaData$TWO_PA, THREE_PA = nbaData$THREE_PA)
In order to construct a heatmap with `ggplot2
`, the `smallerStats
` data frame must be rearranged into a “tidy” format. This type of data organization can be difficult to comprehend for novice R programmers, but the main point is that the data is not being changed, merely rearranged. The `tidyr
` library function `pivot_longer
` accepts the data frame first. Next, the `cols
` parameter is defined. In this case, the column to pivot upon is the `player
` column. This will result in each player’s name being repeated and two new columns being created. These columns are defined in the function as `names_to
` and `values_to
`, respectively. In the end, each player and corresponding statistic name and value are captured as a row. Whereas the `smallerStats
` data frame had 19 observations with 4 columns, now the `nbaDataLong
` object which has been pivoted by the `player
` column has 57 rows and 3 columns. After the pivot the `head
` function is executed to demonstrate the difference.
nbaDataLong <- pivot_longer(data = smallerStats, cols = -c(player), names_to = "stat", values_to = "value") head(nbaDataLong)
Now that the data has been modified, it will be readily accepted by the `ggplot
` function. Instead of the previous `qplot
` function, now the more expansive `ggplot
` function is called. The first parameter is the `data
` object. The next parameter is the `mapping
` aesthetics. This is a multi-part input declared with yet another function `aes
`. Within the `aes
` function, the column names to be plotted are defined. Specifically, the x-axis column name, `stat
`, followed by the y-axis column name `player
`, and finally the fill value which corresponds to the `value
` column. Thus, the visual is set up so that player statistics are arranged on the x-axis, individual players will be a single row along the y-axis, and the color intensity will be scaled by the players corresponding statistical value. Once the base layer plot has been defined, another layer is added with the `+
` sign to declare the type of plot needed. In this case, the heatmap is called using `geom_tile
`. In subsequent chapters, additional visuals are illustrated including ggplot2 and more dynamic interactive graphics. Since this text requires gray-scale graphics, another layer is added to define the color intensity between `lightgrey
` and `black
`. Finally, another layer is added to retitle the x-axis label as “Scoring Statistics” encased in quotes because it is a label not an object or column name. For simplicity, this is captured in an object called `heatPlot
`.
heatPlot <- ggplot(data = nbaDataLong, mapping = aes(x = stat, y = player, fill = value)) + geom_tile() + scale_fill_gradient(low="lightgrey", high="black") + xlab(label = "Scoring Statistics")
Although calling `heatPlot
` now in the console will create the visual, some additional layers can be added. First, a predefined theme for Highcharts is added, just as before using `theme_hc
`. Next, a chart title is declared with `ggtitle
` along with the quoted “Dallas Team Offensive Stats.” Lastly, a `theme
` is appended as the final layer that simply removed the legend altogether. Now when the `heatPlot
` object is called, a clean, visually compelling plot is created that clearly shows the most offensively productive player for the three statistics on the team. Additionally, other player’s strengths in these statistics are easily understood because their sections are darker compared to teammates. Conversely weaker players in these stats have a lighter color. These facts are more quickly understood in a visual compared to reviewing a table of player data. The result of the `heatPlot
` object is shown in Figure 1.9.
Figure 1.9 The Dallas team statistics represented in a heatmap illustrating the most impactful players among these statistics in the 2019–2020 regular NBA season.
heatPlot <- heatPlot + theme_hc() + ggtitle('Dallas Team Offensive Stats') + theme(legend.position = "none")
There are multiple ways to extend the lessons of this chapter to improve R coding fluency. For example, the data itself can be explored further or subset by position. Additional visualizations are also possible although many of these topics are covered in subsequent chapters with expanded explanations.
Positives and Negatives of R
In the end, R should be known as a scripting language. It is not a “low level language” like Java where the code itself is executed directly on the hardware such as a central processing unit, CPU. As a scripting language R cannot be compiled into an executable standalone program. As a result, R has some constraints, in that it is slower than other languages like Java and building a standalone application is not possible. However, the benefit is that R is capable of executing multiple operations borrowing from languages as needed and it well suited to statistical tasks. For example, operations in machine learning like Random Forest are called with R functions yet executed in Fortran. Still other functions borrow from C, SQL, Weka, and so on. This diversity makes the functionality high because the functional tool set is vast but with so much going on under the hood, R can be slow. Another drawback to R is that objects are stored in active memory. As a result, data objects are limited by the amount of RAM on your laptop or R-Studio server. For many data tasks this is not a problem especially as you transition to scalable cloud servers. In fact, in this text, all the data tasks and computation are relatively small. Constraints start to exist after loading millions of rows and columns in a data frame or when doing computationally complex calculations as is the case with Depp Neural Networks. However, this book is focused on sports analytics rather than big data or machine learning exclusively. Thus, these drawbacks to the R language will not be an issue.
Besides the ability to execute functions drawing from across multiple specialized languages R has other positive benefits. For example, R has a well-developed support community. Often when you are presented with an error or unknown operation, a simple online search will identify the solution. Additionally, R is optimized