| Screencast | Date | Notable Topics | Annotated | Link | Data |
|---|---|---|---|---|---|
| College Majors and Income | 2018-10-15 | Graphing for EDA (Exploratory Data Analysis) | ✔️ | 🔗 | 📈 |
| Horror Movie Profits | 2018-10-23 | Graphing for EDA (Exploratory Data Analysis) | ✔️ | 🔗 | 📈 |
| R Downloads | 2018-10-30 | Data manipulation (especially time series using lubridate package) |
✔️ | 🔗 | 📈 |
| US Wind Turbines | 2018-11-06 | Animated map using gganimate |
✔️ | 🔗 | 📈 |
| Malaria Incidence | 2018-11-12 | Map visualization, Animated map using gganimate package |
✔️ | 🔗 | 📈 |
| Thanksgiving Dinner | 2018-11-21 | Survey data, Network graphing | ✔️ | 🔗 | 📈 |
| Maryland Bridges | 2018-11-27 | Data manipulation, Map visualization | ✔️ | 🔗 | 📈 |
| Medium Articles | 2018-12-04 | Text mining using tidytext package |
✔️ | 🔗 | 📈 |
| Riddler: Monte Carlo Simulation | 2018-12-04 | Simulation | ✔️ | 🔗 | 📈 |
| NYC Restaurant Inspections | 2018-12-11 | Multiple t-test models using broom package, Principal Component Analysis (PCA) |
✔️ | 🔗 | 📈 |
| Riddler: Simulating a Week of Rain | 2018-12-12 | Simulation | ✔️ | 🔗 | 📈 |
| Dolphins | 2018-12-18 | Survival analysis using survival package |
✔️ | 🔗 | 📈 |
| TidyTuesday Tweets | 2019-01-07 | Text mining using tidytext package |
✔️ | 🔗 | 📈 |
| TV Golden Age | 2019-01-09 | Data manipulation, Logistic regression | ✔️ | 🔗 | 📈 |
| Space Launches | 2019-01-15 | Graphing for EDA (Exploratory Data Analysis) | ✔️ | 🔗 | 📈 |
| US Incarceration | 2019-01-25 | Animated map using gganimate package, Dealing with missing data |
✔️ | 🔗 | 📈 |
| US Dairy Consumption | 2019-01-29 | Time series analysis, Forecasting using sweep package |
✔️ | 🔗 | 📈 |
| US PhDs | 2019-02-22 | Tidying very un-tidy data | ✔️ | 🔗 | 📈 |
| French Train Delays | 2019-02-26 | Heat map | ✔️ | 🔗 | 📈 |
| Women in the Workplace | 2019-03-05 | Interactive scatterplot using plotly and shiny packages |
✔️ | 🔗 | 📈 |
| Board Game Reviews | 2019-03-15 | Lasso regression using glmnet package |
✔️ | 🔗 | 📈 |
| Seattle Pet Names | 2019-03-16 | Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing | ✔️ | 🔗 | 📈 |
| Seattle Bike Counts | 2019-04-05 | Data manipulation (especially time series using lubridate package) |
✔️ | 🔗 | 📈 |
| Tennis Tournaments | 2019-04-09 | Data manipulation (especially using dplyr for groups within dataframes) |
✔️ | 🔗 | 📈 |
| Bird Collisions | 2019-05-03 | Bootstrapping | ✔️ | 🔗 | 📈 |
| Student Teacher Ratios | 2019-05-10 | WDI package (World Development Indicators) |
✔️ | 🔗 | 📈 |
| Nobel Prize Winners | 2019-05-24 | Data manipulation, Graphing for EDA (Exploratory Data Analysis) | ✔️ | 🔗 | 📈 |
| Plastic Waste | 2019-05-27 | Choropleth map | ✔️ | 🔗 | 📈 |
| Wine Ratings | 2019-05-31 | Text mining using tidytext package, Lasso regression using glmnet package |
✔️ | 🔗 | 📈 |
| Ramen Reviews | 2019-06-04 | Web scraping using rvest package |
✔️ | 🔗 | 📈 |
| Media Franchise Revenue | 2019-06-22 | Data manipulation (especially re-ordering factors) | ✔️ | 🔗 | 📈 |
| Women's World Cup | 2019-07-22 | Data manipulation and exploratory graphing | ✔️ | 🔗 | 📈 |
| Bob Ross Paintings | 2019-08-12 | Network graphs, Principal Component Analysis (PCA) | ✔️ | 🔗 | 📈 |
| Simpsons Guest Stars | 2019-08-30 | Text mining using tidytext package |
✔️ | 🔗 | 📈 |
| Pizza Ratings | 2019-10-01 | Statistical testing with t.test |
✔️ | 🔗 | 📈 |
| Car Fuel Efficiency | 2019-10-15 | Natural splines for regression | ✔️ | 🔗 | 📈 |
| Horror Movies | 2019-10-22 | ANOVA, Text mining using tidytext package, Lasso regression using glmnet package |
✔️ | 🔗 | 📈 |
| NYC Squirrel Census | 2019-11-01 | Map visualization using ggmap package |
✔️ | 🔗 | 📈 |
| CRAN Package Code | 2019-12-30 | Graphing for EDA (Exploratory Data Analysis) | ✔️ | 🔗 | 📈 |
| Riddler: Spelling Bee Honeycomb | 2020-01-06 | Simulation with matrixes | ✔️ | 🔗 | 📈 |
| The Office | 2020-03-16 | Text mining using tidytext package, Lasso regression using glmnet package |
✔️ | 🔗 | 📈 |
| COVID-19 Open Research Dataset (CORD-19) | 2020-03-18 | JSON formatted data | ✔️ | 🔗 | 📈 |
| CORD-19 Data Package | 2020-03-19 | R package development and documentation-writing | ✔️ | 🔗 | 📈 |
R trick: Creating Pascal's Triangle with accumulate() |
2020-03-29 | accumulate() for recursive formulas |
✔️ | 🔗 | 📈 |
| Riddler: Simulating Replacing Die Sides | 2020-03-30 | accumulate() for simulation |
✔️ | 🔗 | 📈 |
| Beer Production | 2020-04-01 | tidymetrics package demonstrated, Animated map (gganimate package) |
✔️ | 🔗 | 📈 |
| Riddler: Simulating a Non-increasing Sequence | 2020-04-06 | Simulation | ✔️ | 🔗 | 📈 |
| Tour de France | 2020-04-07 | Survival analysis, Animated bar graph (gganimate package) |
✔️ | 🔗 | 📈 |
| Riddler: Simulating a Branching Process | 2020-04-13 | Simulation, Exponential and Geometric distributions | ✔️ | 🔗 | 📈 |
| GDPR Violations | 2020-04-21 | Data manipulation, Interactive dashboard with shinymetrics and tidymetrics |
✔️ | 🔗 | 📈 |
| Broadway Musicals | 2020-04-28 | Creating an interactive dashboard with shinymetrics and tidymetrics, moving windows, period aggregation |
✔️ | 🔗 | 📈 |
| Riddler: Simulating and Optimizing Coin Flipping | 2020-05-03 | Simulation | ✔️ | 🔗 | 📈 |
| Animal Crossing | 2020-05-05 | Text mining using tidytext package |
✔️ | 🔗 | 📈 |
| Volcano Eruptions | 2020-05-12 | Static map with ggplot2, Interactive map with leaflet, Animated map with gganimate |
✔️ | 🔗 | 📈 |
| Beach Volleyball | 2020-05-19 | Data cleaning, Logistic regression | ✔️ | 🔗 | 📈 |
| Cocktails | 2020-05-26 | Pairwise correlation, Network diagram, Principal component analysis (PCA) | ✔️ | 🔗 | 📈 |
| African-American Achievements | 2020-06-09 | plotly interactive timeline, Wikipedia web scraping |
✔️ | 🔗 | 📈 |
| African-American History | 2020-06-16 | Network diagram, Wordcloud | ✔️ | 🔗 | 📈 |
| Caribou Locations | 2020-06-23 | Maps with ggplot2, Calculating distance and speed with geosphere |
✔️ | 🔗 | 📈 |
| X-Men Comics | 2020-06-30 | Data manipulation, Lollipop graph, floor function |
✔️ | 🔗 | 📈 |
| Coffee Ratings | 2020-07-07 | Ridgeline plot, Pairwise correlation, Network plot, Singular value decomposition (SVD), Linear model | ✔️ | 🔗 | 📈 |
| Australian Animal Outcomes | 2020-07-21 | Data manipulation, Web scraping (rvest package) and SelectorGadget, Animated choropleth map |
✔️ | 🔗 | 📈 |
| Palmer Penguins | 2020-07-08 | Modeling (logistic regression, k-nearest neighbors, decision tree, multiclass logistic regression) with cross validated accuracy | ✔️ | 🔗 | 📈 |
| European Energy | 2020-08-04 | Data manipulation, Country flags, Slope graph, Function creation | ✔️ | 🔗 | 📈 |
| Plants in Danger | 2020-08-18 | Data manipulation, Web scraping using rvest package |
✔️ | 🔗 | 📈 |
| Chopped | 2020-08-25 | Data manipulation, Modelling (Linear Regression, Random Forest, and Natural Splines) | ✔️ | 🔗 | 📈 |
| Global Crop Yields | 2020-09-01 | Interactive Shiny dashboard | ✔️ | 🔗 | 📈 |
| Friends | 2020-09-08 | Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining | ✔️ | 🔗 | 📈 |
| Government Spending on Kids | 2020-09-15 | Data Manipulation, Functions, Embracing, Reading in Many .csv Files, Pairwise Correlation | ✔️ | 🔗 | 📈 |
| Himalayan Climbers | 2020-09-22 | Data Manipulation, Empirical Bayes, Logistic Regression Model | ✔️ | 🔗 | 📈 |
| Beyoncé and Taylor Swift Lyrics | 2020-09-29 | Text analysis, tf_idf, Log odds ratio, Diverging bar graph, Lollipop graph |
✔️ | 🔗 | 📈 |
| NCAA Women's Basketball | 2020-10-06 | Heatmap, Correlation analysis | ✔️ | 🔗 | 📈 |
| Great American Beer Festival | 2020-10-20 | Log odds ratio, Logistic regression, TIE Fighter plot | ✔️ | 🔗 | 📈 |
| IKEA Furniture | 2020-11-03 | Linear model, Coefficient/TIE fighter plot, Boxplots, Log scale discussion, Calculating volume | ✔️ | 🔗 | 📈 |
| Historical Phones | 2020-11-10 | Joining tables, Animated world choropleth, Adding IQR to geom_line, World development indicators package | ✔️ | 🔗 | 📈 |
| Riddler: Simulating a Circular Random Walk | 2020-11-23 | Simulation | ✔️ | 🔗 | 📈 |
| Ninja Warrior | 2020-12-15 | Log-odds with tidylo package, Graphing with ggplot2 |
✔️ | 🔗 | 📈 |
| Screencast | Time | Description |
|---|---|---|
| College Majors and Income | 1:45 | Using read_csv function to import data directly from Github to R (without cloning the repository) |
| College Majors and Income | 7:20 | Creating a histogram (geom_histogram), then a boxplot (geom_boxplot), to explore the distribution of salaries |
| College Majors and Income | 8:55 | Using fct_reorder function to sort boxplot of college majors by salary |
| College Majors and Income | 9:35 | Using dollar_format function from scales package to convert scientific notation to dollar format (e.g., "4e+04" becomes "$40,000") |
| College Majors and Income | 14:10 | Creating a dotplot (geom_point) of 20 top-earning majors (includes adjusting axis, using the colour aesthetic, and adding error bars) |
| College Majors and Income | 17:45 | Using str_to_title function to convert string from ALL CAPS to Title Case |
| College Majors and Income | 20:45 | Creating a Bland-Altman graph to explore relationship between sample size and median salary |
| College Majors and Income | 21:45 | Using geom_text_repel function from ggrepel package to get text labels on scatter plot points |
| College Majors and Income | 28:30 | Using count function's wt argument to specify what should be counted (default is number of rows) |
| College Majors and Income | 30:00 | Spicing up a dull bar graph by adding a redundant colour aesthetic (trick from Julia Silge) |
| College Majors and Income | 36:20 | Starting to explore relationship between gender and salary |
| College Majors and Income | 37:10 | Creating a stacked bar graph (geom_col) of gender breakdown within majors |
| College Majors and Income | 40:15 | Using summarise_at to aggregate men and women from majors into categories of majors |
| College Majors and Income | 45:30 | Graphing scatterplot (geom_point) of share of women and median salary |
| College Majors and Income | 47:10 | Using geom_smooth function to add a line of best fit to scatterplot above |
| College Majors and Income | 48:40 | Explanation of why not to aggregate first when performing a statistical test (including explanation of Simpson's Paradox) |
| College Majors and Income | 49:55 | Fixing geom_smooth so that we get one overall line while still being able to map to the colour aesthetic |
| College Majors and Income | 51:10 | Predicting median salary from share of women with weighted linear regression (to take sample sizes into account) |
| College Majors and Income | 56:05 | Using nest function and tidy function from the broom package to apply a linear model to many categories at once |
| College Majors and Income | 58:05 | Using p.adjust function to adjust p-values to correct for multiple testing (using FDR, False Discovery Rate) |
| College Majors and Income | 1:04:50 | Showing how to add an appendix to an Rmarkdown file with code that doesn't run when compiled |
| College Majors and Income | 1:09:00 | Using fct_lump function to aggregate major categories into the top four and an "Other" category |
| College Majors and Income | 1:10:05 | Adding sample size to the size aesthetic within the aes function |
| College Majors and Income | 1:10:50 | Using ggplotly function from plotly package to create an interactive scatterplot (tooltips appear when moused over) |
| College Majors and Income | 1:15:55 | Exploring IQR (Inter-Quartile Range) of salaries by major |
| Screencast | Time | Description |
|---|---|---|
| Horror Movie Profits | 2:50 | Using parse_date function from lubridate package to convert date formatted as character to date class (should have used mdy function though) |
| Horror Movie Profits | 7:45 | Using fct_lump function to aggregate distributors into top 6 (by number of movies) and and "Other" category |
| Horror Movie Profits | 8:50 | Investigating strange numbers in the data and discovering duplication |
| Horror Movie Profits | 12:40 | Using problems function to look at parsing errors when importing data |
| Horror Movie Profits | 14:35 | Using arrange and distinct function and its .keep_all argument to de-duplicate observations |
| Horror Movie Profits | 16:10 | Using geom_boxplot function to create a boxplot of budget by distributor |
| Horror Movie Profits | 19:20 | Using floor function to bin release years into decades (e.g., "1970" and "1973" both become "1970") |
| Horror Movie Profits | 21:30 | Using summarise_at function to apply the same function to multiple variables at the same time |
| Horror Movie Profits | 24:10 | Using geom_line to visualize multiple metrics at the same time |
| Horror Movie Profits | 26:00 | Using facet_wrap function to graph small multiples of genre-budget boxplots by distributor |
| Horror Movie Profits | 28:35 | Starting analysis of profit ratio of movies |
| Horror Movie Profits | 32:50 | Using paste0 function in a custom function to show labels of multiple (e.g., "4X" or "6X" to mean "4 times" or "6 times") |
| Horror Movie Profits | 41:20 | Starting analysis of the most common genres over time |
| Horror Movie Profits | 45:55 | Starting analysis of the most profitable individual horror movies |
| Horror Movie Profits | 51:45 | Using paste0 function to add release date of movie to labels in a bar graph |
| Horror Movie Profits | 53:25 | Using geom_text function, along with its check_overlap argument, to add labels to some points on a scatterplot |
| Horror Movie Profits | 58:10 | Using ggplotly function from plotly package to create an interactive scatterplot |
| Horror Movie Profits | 1:00:55 | Reviewing unexplored areas of investigation |
| Screencast | Time | Description |
|---|---|---|
| R Downloads | 5:20 | Using geom_line function to visualize changes over time |
| R Downloads | 7:35 | Starting to decompose time series data into day-of-week trend and overall trend (lots of lubridate package functions) |
| R Downloads | 9:50 | Using floor_date function from lubridate package to round dates down to the week level |
| R Downloads | 10:05 | Using min function to drop incomplete/partial week at the start of the dataset |
| R Downloads | 12:20 | Using countrycode function from countrycode package to replace two-letter country codes with full names (e.g., "CA" becomes "Canada") |
| R Downloads | 17:20 | Using fct_lump function to get top N categories within a categorical variable and classify the rest as "Other" |
| R Downloads | 20:30 | Using hour function from lubridate package to pull out integer hour value from a datetime variable |
| R Downloads | 22:20 | Using facet_wrap function to graph small multiples of downloads by country, then changing its scales argument to allow different scales on y-axis |
| R Downloads | 31:00 | Starting analysis of downloads by IP address |
| R Downloads | 35:20 | Using as.POSIXlt to combine separate date and time variables to get a single datetime variable |
| R Downloads | 36:35 | Using lag function to calculate time between downloads (time between events) per IP address (comparable to SQL window function) |
| R Downloads | 38:05 | Using as.numeric function to convert variable from a time interval object to a numeric variable (number in seconds) |
| R Downloads | 38:40 | Explanation of a bimodal log-normal distribution |
| R Downloads | 39:05 | Handy trick for setting easy-to-interpret intervals for time data on scale_x_log10 function's breaks argument |
| R Downloads | 47:40 | Starting to explore package downloads |
| R Downloads | 52:15 | Adding 1 to the numerator and denominator when calculating a ratio to get around dividing by zero |
| R Downloads | 57:55 | Showing how to look at package download data over time using cran_downloads function from the cranlogs package |
| Screencast | Time | Description |
|---|---|---|
| US Wind Turbines | 3:50 | Using count function to explore categorical variables |
| US Wind Turbines | 5:00 | Creating a quick-and-dirty map using geom_point function and latitude and longitude data |
| US Wind Turbines | 6:10 | Explaining need for mapproj package when plotting maps in ggplot2 |
| US Wind Turbines | 7:35 | Using borders function to add US state borders to map |
| US Wind Turbines | 10:45 | Using fct_lump function to get the top 6 project categories and put the rest in a lumped "Other" category |
| US Wind Turbines | 11:30 | Changing data so that certain categories' points appear in front of other categories' points on the map |
| US Wind Turbines | 14:15 | Taking the centroid (average longitude and latitude) of points across a geographic area as a way to aggregate categories to one point |
| US Wind Turbines | 19:40 | Using ifelse function to clean missing data that is coded as "-9999" |
| US Wind Turbines | 26:00 | Asking, "How has turbine capacity changed over time?" |
| US Wind Turbines | 33:15 | Exploring different models of wind turbines |
| US Wind Turbines | 38:00 | Using mutate_if function to find NA values (coded as -9999) in multiple columns and replace them with an actual NA |
| US Wind Turbines | 45:40 | Reviewing documentation for gganimate package |
| US Wind Turbines | 47:00 | Attempting to set up gganimate map |
| US Wind Turbines | 48:55 | Understanding gganimate package using a "Hello World" / toy example, then trying to debug turbine animation |
| US Wind Turbines | 56:45 | Using is.infinite function to get rid of troublesome Inf values |
| US Wind Turbines | 57:55 | Quick hack for getting cumulative data from a table using crossing function (though it does end up with some duplication) |
| US Wind Turbines | 1:01:45 | Diagnosis of gganimate issue (points between integer years are being interpolated) |
| US Wind Turbines | 1:04:35 | Pseudo-successful gganimate map (cumulative points show up, but some points are missing) |
| US Wind Turbines | 1:05:40 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Malaria Incidence | 2:45 | Importing data using the malariaAtlas package |
| Malaria Incidence | 14:10 | Using geom_line function to visualize malaria prevalence over time |
| Malaria Incidence | 15:10 | Quick map visualization using longitude and latitude coordinates and the geom_point function |
| Malaria Incidence | 18:40 | Using borders function to add Kenyan country borders to map |
| Malaria Incidence | 19:50 | Using scale_colour_gradient2 function to change the colour scale of points on the map |
| Malaria Incidence | 20:40 | Using arrange function to ensure that certain points on a map appear in front of/behind other points |
| Malaria Incidence | 21:50 | Aggregating data into decades using the truncated division operator %/% |
| Malaria Incidence | 24:45 | Starting to look at aggregated malaria data (instead of country-specific data) |
| Malaria Incidence | 26:50 | Using sample and unique functions to randomly select a few countries, which are then graphed |
| Malaria Incidence | 28:30 | Using last function to select the most recent observation from a set of arranged data |
| Malaria Incidence | 32:55 | Creating a Bland-Altman plot to explore relationship between current incidence and change in incidence in past 15 years |
| Malaria Incidence | 35:45 | Using anti_join function to find which countries are not in the malaria dataset |
| Malaria Incidence | 36:40 | Using the iso3166 dataset set in the maps package to match three-letter country code (i.e., the ISO 3166 code) with country names |
| Malaria Incidence | 38:30 | Creating a world map using geom_polygon function (and eventually theme_void and coord_map functions) |
| Malaria Incidence | 39:00 | Getting rid of Antarctica from world map |
| Malaria Incidence | 42:35 | Using facet_wrap function to create small multiples of world map for different time periods |
| Malaria Incidence | 47:30 | Starting to create an animated map of malaria deaths (actual code writing starts at 57:45) |
| Malaria Incidence | 51:25 | Starting with a single year after working through some bugs |
| Malaria Incidence | 52:10 | Using regex_inner_join function from the fuzzyjoin package to join map datasets because one of them has values in regular expressions |
| Malaria Incidence | 55:15 | As alternative to fuzzyjoin package in above step, using str_remove function to get rid of unwanted regex |
| Malaria Incidence | 57:45 | Starting to turn static map into an animation using gganimate package |
| Malaria Incidence | 1:02:00 | The actual animated map |
| Malaria Incidence | 1:02:35 | Using countrycode package to filter down to countries in a specific continent (Africa, in this case) |
| Malaria Incidence | 1:03:55 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Thanksgiving Dinner | 4:10 | Exploratory bar chart of age distribution (and gender) of survey respondents |
| Thanksgiving Dinner | 7:40 | Using count function on multiple columns to get detailed counts |
| Thanksgiving Dinner | 11:25 | Parsing numbers from text using parse_number function, then using those numbers to re-level an ordinal factor (income bands) |
| Thanksgiving Dinner | 13:05 | Exploring relationship between income and using homemade (vs. canned) cranberry sauce |
| Thanksgiving Dinner | 14:00 | Adding group = 1 argument to the aes function to properly display a line chart |
| Thanksgiving Dinner | 14:30 | Rotating text for axis labels that overlap |
| Thanksgiving Dinner | 16:50 | Getting confidence intervals for proportions using Jeffreys interval (using beta distribution with an uniformative prior) |
| Thanksgiving Dinner | 17:55 | Explanation of Clopper-Pearson approach as alternative to Jeffreys interval |
| Thanksgiving Dinner | 18:30 | Using geom_ribbon function add shaded region to line chart that shows confidence intervals |
| Thanksgiving Dinner | 21:55 | Using starts_with function to select fields with names that start with a certain string (e.g., using "pie" selects "pie1" and "pie2") |
| Thanksgiving Dinner | 22:55 | Using gather function to get wide-format data to tidy (tall) format |
| Thanksgiving Dinner | 23:45 | Using str_remove and regex to remove digits from field values (e.g., "dessert1" and "dessert2" get turned into "dessert") |
| Thanksgiving Dinner | 27:00 | "What are people eating?" Graphing pies, sides, and desserts |
| Thanksgiving Dinner | 28:00 | Using fct_reorder function to reorder foods based on how popular they are |
| Thanksgiving Dinner | 28:45 | Using n_distinct function count the number of unique respondents |
| Thanksgiving Dinner | 30:25 | Using facet_wrap function to facet food types into their own graphs |
| Thanksgiving Dinner | 32:50 | Using parse_number function to convert age ranges as character string into a numeric field |
| Thanksgiving Dinner | 35:35 | Exploring relationship between US region and food types |
| Thanksgiving Dinner | 36:15 | Using group_by, then mutate, then count to calculate a complicated summary |
| Thanksgiving Dinner | 40:35 | Exploring relationship between praying at Thanksgiving (yes/no) and food types |
| Thanksgiving Dinner | 42:30 | Empirical Bayes binomial estimation for calculating binomial confidence intervals (see Dave's book on Empirical Bayes) |
| Thanksgiving Dinner | 45:30 | Asking, "What sides/desserts/pies are eaten together?" |
| Thanksgiving Dinner | 46:20 | Calculating pairwise correlation of food types |
| Thanksgiving Dinner | 49:05 | Network graph of pairwise correlation |
| Thanksgiving Dinner | 51:40 | Adding text labels to nodes using geom_node_text function |
| Thanksgiving Dinner | 53:00 | Getting rid of unnecessary graph elements (e.g., axes, gridlines) with theme_void function |
| Thanksgiving Dinner | 53:25 | Explanation of network graph relationships |
| Thanksgiving Dinner | 55:05 | Adding dimension to network graph (node colour) to represent the type of food |
| Thanksgiving Dinner | 57:45 | Fixing overlapping text labels using the geom_node_text function's repel argument |
| Thanksgiving Dinner | 58:55 | Tweaking display of percentage legend to be in more readable format (e.g., "40%" instead of "0.4") |
| Thanksgiving Dinner | 1:00:05 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Maryland Bridges | 9:15 | Using geom_line to create an exploratory line graph |
| Maryland Bridges | 10:10 | Using %/% operator (truncated division) to bin years into decades (e.g., 1980, 1984, and 1987 would all become "1980") |
| Maryland Bridges | 12:30 | Converting two-digit year to four-digit year (e.g., "16" becomes "2016") by adding 2000 to each one |
| Maryland Bridges | 15:40 | Using percent_format function from scales package to get nice-looking axis labels |
| Maryland Bridges | 19:55 | Using geom_col to create an ordered nice bar/column graph |
| Maryland Bridges | 21:35 | Using replace_na to replace NA values with "Other" |
| Maryland Bridges | 27:15 | Starting exploration of average daily traffic |
| Maryland Bridges | 29:05 | Using comma_format function from scales package to get more readable axis labels (e.g., "1e+05" becomes "100,000") |
| Maryland Bridges | 31:15 | Using cut function to bin continuous variable into customized breaks (also does a mutate within a group_by!) |
| Maryland Bridges | 34:30 | Starting to make a map |
| Maryland Bridges | 37:00 | Encoding a continuous variable to colour, then using scale_colour_gradient2 function to specify colours and midpoint |
| Maryland Bridges | 38:20 | Specifying the trans argument (transformation) of the scale_colour_gradient2 function to get a logarithmic scale |
| Maryland Bridges | 45:55 | Using str_to_title function to get values to Title Case (first letter of each word capitalized) |
| Maryland Bridges | 48:35 | Predicting whether bridges are in "Good" condition using logistic regression (remember to specify the family argument! Dave fixes this at 52:54) |
| Maryland Bridges | 50:30 | Explanation of why we should NOT be using an OLS linear regression |
| Maryland Bridges | 51:10 | Using the augment function from the broom package to illustrate why a linear model is not a good fit |
| Maryland Bridges | 52:05 | Specifying the type.predict argument in the augment function so that we get the actual predicted probability |
| Maryland Bridges | 54:40 | Explanation of why the sigmoidal shape of logistic regression can be a drawback |
| Maryland Bridges | 55:05 | Using a cubic spline model (a type of GAM, Generalized Additive Model) as an alternative to logistic regression |
| Maryland Bridges | 56:00 | Explanation of the shape that a cubic spline model can take (which logistic regression cannot) |
| Maryland Bridges | 1:02:15 | Visualizing the model in a different way, using a coefficient plot |
| Maryland Bridges | 1:04:35 | Using geom_vline function to add a red reference line to a graph |
| Maryland Bridges | 1:04:50 | Adding confidence intervals to the coefficient plot by specifying conf.int argument of tidy function and graphing using the geom_errorbarh function |
| Maryland Bridges | 1:05:35 | Brief explanation of log-odds coefficients |
| Maryland Bridges | 1:09:10 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Medium Articles | 5:40 | Using summarise_at and starts_with functions to quickly sum up all variables starting with "tag_" |
| Medium Articles | 6:55 | Using gather function (now pivot_longer) to convert topic tag variables from wide to tall (tidy) format |
| Medium Articles | 8:10 | Explanation of how gathering step above will let us find the most/least common tags |
| Medium Articles | 9:00 | Explanation of using median (instead of mean) as measure of central tendency for number of claps an article got |
| Medium Articles | 9:50 | Visualizing log-normal (ish) distribution of number of claps an article gets |
| Medium Articles | 12:05 | Using pmin function to bin reading times of 10 minutes or more to cap out at 10 minutes |
| Medium Articles | 12:35 | Changing scale_x_continuous function's breaks argument to get custom labels and tick marks on a histogram |
| Medium Articles | 14:35 | Discussion of using mean vs. median as measure of central tendency for reading time (he decides on mean) |
| Medium Articles | 16:00 | Starting text mining analysis |
| Medium Articles | 16:40 | Using unnest_tokens function from tidytext package to split character string into individual words |
| Medium Articles | 17:50 | Explanation of stop words and using anti_join function to get rid of them |
| Medium Articles | 20:20 | Using str_detect function to filter out "words" that are just numbers (e.g., "2", "35") |
| Medium Articles | 22:35 | Quick analysis of which individual words are associated with more/fewer claps ("What are the hype words?") |
| Medium Articles | 25:15 | Using geometric mean as alternative to median to get more distinction between words (note 27:33 where he makes a quick fix) |
| Medium Articles | 28:10 | Starting analysis of clusters of related words (e.g., "neural" is linked to "network") |
| Medium Articles | 30:30 | Finding correlations pairs of words using pairwise_cor function from widyr package |
| Medium Articles | 34:00 | Using ggraph and igraph packages to make network plot of correlated pairs of words |
| Medium Articles | 35:00 | Using geom_node_text to add labels for points (vertices) in the network plot |
| Medium Articles | 38:40 | Filtering original data to only include words appear in the network plot (150 word pairs with most correlation) |
| Medium Articles | 40:10 | Adding colour as a dimension to the network plot, representing geometric mean of claps |
| Medium Articles | 40:50 | Changing default colour scale to one with Blue = Low and High = Red with scale_colour_gradient2 function |
| Medium Articles | 43:15 | Adding dark outlines to points on network plot with a hack |
| Medium Articles | 44:45 | Starting to predict number of claps based on title tag (Lasso regression) |
| Medium Articles | 45:50 | Explanation of data format needed to conduct Lasso regression (and using cast_sparse function to get sparse matrix) |
| Medium Articles | 47:45 | Bringing in number of claps to the sparse matrix (un-tidy methods) |
| Medium Articles | 49:00 | Using cv.glmnet function (cv = cross validated) from glmnet package to run Lasso regression |
| Medium Articles | 49:55 | Finding and fixing mistake in defining Lasso model |
| Medium Articles | 51:05 | Explanation of Lasso model |
| Medium Articles | 52:35 | Using tidy function from the broom package to tidy up the Lasso model |
| Medium Articles | 54:35 | Visualizing how specific words affect the prediction of claps as lambda (Lasso's penalty parameter) changes |
| Medium Articles | 1:00:20 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Monte Carlo Simulation | 3:10 | Using crossing function to set up structure of simulation (1,000 trials, each with 12 chess games) |
| Riddler: Monte Carlo Simulation | 4:00 | Adding result to the tidy simulation dataset |
| Riddler: Monte Carlo Simulation | 6:45 | Using sample function to simulate win/loss/draw for each game (good explanation of individual arguments within sample) |
| Riddler: Monte Carlo Simulation | 7:05 | Using group_by and summarise to get total points for each trial |
| Riddler: Monte Carlo Simulation | 8:10 | Adding red vertical reference line to histogram to know when a player wins a matchup |
| Riddler: Monte Carlo Simulation | 10:00 | Answering second piece of riddle (how many games would need to be played for better player to win 90% or 99% of the time?) |
| Riddler: Monte Carlo Simulation | 10:50 | Using unnest and seq_len functions to create groups of number of games (20, 40, …, 100), each with one game per row |
| Riddler: Monte Carlo Simulation | 12:15 | Creating a win field based on the simulated data, then summarising win percentage for each group of number of games (20, 40, …, 100) |
| Riddler: Monte Carlo Simulation | 13:55 | Using seq function to create groups of number of games programmatically |
| Riddler: Monte Carlo Simulation | 15:05 | Explanation of using logarithmic scale for this riddle |
| Riddler: Monte Carlo Simulation | 15:45 | Changing spacing of number of games from even spacing (20, 40, …, 100) to exponential (doubles every time, 12, 24, 48, …, 1536) |
| Riddler: Monte Carlo Simulation | 18:00 | Changing spacing of number of games to be finer |
| Riddler: Monte Carlo Simulation | 19:00 | Introduction of interpolation as the last step we will do |
| Riddler: Monte Carlo Simulation | 19:30 | Introducing approx function as method to linearly interpolate data |
| Riddler: Monte Carlo Simulation | 22:35 | Break point for the next riddle |
| Riddler: Monte Carlo Simulation | 24:30 | Starting recursive approach to this riddle |
| Riddler: Monte Carlo Simulation | 25:35 | Setting up a N x N matrix (N = 4 to start) |
| Riddler: Monte Carlo Simulation | 25:55 | Explanation of approach (random ball goes into random cup, represented by matrix) |
| Riddler: Monte Carlo Simulation | 26:25 | Using sample function to pick a random element of the matrix |
| Riddler: Monte Carlo Simulation | 27:15 | Using for loop to iterate random selection 100 times |
| Riddler: Monte Carlo Simulation | 28:25 | Converting for loop to while loop, using colSums to keep track of number of balls in cups |
| Riddler: Monte Carlo Simulation | 30:05 | Starting to code the pruning phase |
| Riddler: Monte Carlo Simulation | 30:15 | Using diag function to pick matching matrix elements (e.g., the 4th row of the 4th column) |
| Riddler: Monte Carlo Simulation | 31:50 | Turning code up to this point into a custom simulate_round function |
| Riddler: Monte Carlo Simulation | 32:25 | Using custom simulate_round function to simulate 100 rounds |
| Riddler: Monte Carlo Simulation | 33:30 | Using all function to perform logic check on whether all cups in a round are not empty |
| Riddler: Monte Carlo Simulation | 34:05 | Converting loop approach to tidy approach |
| Riddler: Monte Carlo Simulation | 35:10 | Using rerun and map_lgl functions from purrr to simulate a round for each for in a dataframe |
| Riddler: Monte Carlo Simulation | 36:20 | Explanation of the tidy approach |
| Riddler: Monte Carlo Simulation | 37:05 | Using cumsum and lag functions to keep track of the number of rounds until you win a "game" |
| Riddler: Monte Carlo Simulation | 39:45 | Creating histogram of number of rounds until winning a game |
| Riddler: Monte Carlo Simulation | 40:10 | Setting boundary argument of geom_histogram function to include count of zeros |
| Riddler: Monte Carlo Simulation | 40:30 | Brief explanation of geometric distribution |
| Riddler: Monte Carlo Simulation | 41:25 | Extending custom simulate_round function to include number of balls thrown to win (in addition to whether we won a round) |
| Riddler: Monte Carlo Simulation | 46:10 | Extending to two values of N (N = 3 or N = 4) |
| Riddler: Monte Carlo Simulation | 49:50 | Reviewing results of N = 3 and N = 4 |
| Riddler: Monte Carlo Simulation | 52:20 | Extending to N = 5 |
| Riddler: Monte Carlo Simulation | 53:55 | Checking results of chess riddle with Riddler solution |
| Riddler: Monte Carlo Simulation | 55:10 | Checking results of ball-cup riddle with Riddler solution (Dave slightly misinterpreted what the riddle was asking) |
| Riddler: Monte Carlo Simulation | 56:35 | Changing simulation code to correct the misinterpretation |
| Riddler: Monte Carlo Simulation | 1:01:40 | Reviewing results of corrected simulation |
| Riddler: Monte Carlo Simulation | 1:03:30 | Checking results of ball-cup riddle with corrected simulation with Riddler solutions |
| Riddler: Monte Carlo Simulation | 1:06:00 | Visualizing number of balls thrown and rounds played |
| Screencast | Time | Description |
|---|---|---|
| NYC Restaurant Inspections | 18:45 | Separating column using separate function |
| NYC Restaurant Inspections | 21:15 | Taking distinct observations, but keeping the remaining variables using distinct function with .keep_all argument |
| NYC Restaurant Inspections | 25:00 | Using broom package and nest function to perform multiple t-tests at the same time |
| NYC Restaurant Inspections | 26:20 | Tidying nested t-test models using broom |
| NYC Restaurant Inspections | 27:00 | Creating TIE fighter plot of estimates of means and their confidence intervals |
| NYC Restaurant Inspections | 28:45 | Recode long description using regex to remove everything after a parenthesis |
| NYC Restaurant Inspections | 33:45 | Using cut function to manually bin data along user-specified intervals |
| NYC Restaurant Inspections | 42:00 | Asking, "What type of violations tend to occur more in some cuisines than others?" |
| NYC Restaurant Inspections | 42:45 | Using semi_join function to get the most recent inspection of all the restaurants |
| NYC Restaurant Inspections | 52:00 | Asking, "What violations tend to occur together?" |
| NYC Restaurant Inspections | 53:00 | Using widyr package function pairwise_cor (pairwise correlation) to find co-occurrence of violation types |
| NYC Restaurant Inspections | 55:30 | Beginning of PCA (Principal Component Analysis) using widely_svd function |
| NYC Restaurant Inspections | 58:00 | Actually typing in the widely_svd function |
| NYC Restaurant Inspections | 58:15 | Reviewing and explaining output of widely_svd function |
| NYC Restaurant Inspections | 1:01:30 | Creating graph of opposing elements of a PCA dimension |
| NYC Restaurant Inspections | 1:02:00 | Shortening string using str_sub function |
| NYC Restaurant Inspections | 1:04:00 | Reference to Julia Silge's PCA walkthrough using StackOverflow data |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Simulating a Week of Rain | 1:20 | Using crossing function to get all combinations of specified variables (100 trials of 5 days) |
| Riddler: Simulating a Week of Rain | 2:35 | Using rbinom function to simulate whether it rains or not |
| Riddler: Simulating a Week of Rain | 3:15 | Using ifelse function to set starting number of umbrellas at beginning of week |
| Riddler: Simulating a Week of Rain | 4:20 | Explanation of structure of simulation and approach to determining number of umbrellas in each location |
| Riddler: Simulating a Week of Rain | 5:30 | Changing structure so that we have a row for each day's morning or evening |
| Riddler: Simulating a Week of Rain | 7:10 | Using group_by, ifelse, and row_number functions to set starting number of umbrellas for each trial |
| Riddler: Simulating a Week of Rain | 8:45 | Using case_when function to returns different values for multiple logical checks (allows for more outputs than ifelse) |
| Riddler: Simulating a Week of Rain | 10:20 | Using cumsum function to create a running tally of number of umbrellas in each location |
| Riddler: Simulating a Week of Rain | 11:25 | Explanation of output of simulated data |
| Riddler: Simulating a Week of Rain | 12:30 | Using any function to check if any day had a negative "umbrella count" (indicating there wasn't an umbrella available when raining) |
| Riddler: Simulating a Week of Rain | 15:40 | Asking, "When was the first time Louie got wet?" |
| Riddler: Simulating a Week of Rain | 17:10 | Creating a custom vector to convert an integer to a weekday (e.g., 2 = Tue) |
| Screencast | Time | Description |
|---|---|---|
| Dolphins | 6:25 | Using year function from lubridate package to simplify calculating age of dolphins |
| Dolphins | 8:30 | Combining count and fct_lump functions to get counts of top 5 species (with other species lumped in "Other") |
| Dolphins | 9:55 | Creating boxplot of species and age |
| Dolphins | 11:50 | Dealing with different types of NA (double, logical) (he doesn't get it in this case, but it's still useful) |
| Dolphins | 15:30 | Adding acquisition type as colour dimension to histogram |
| Dolphins | 16:00 | Creating a spinogram of acquisition type over time (alternative to histogram) using geom_area |
| Dolphins | 17:25 | Binning year into decade using truncated division operator %/% |
| Dolphins | 19:10 | Fixing annoying triangular gaps in spinogram using complete function to fill in gaps in data |
| Dolphins | 21:15 | Using fct_reorder function to reorder acquisition type (bigger categories are placed on the bottom of the spinogram) |
| Dolphins | 23:25 | Adding vertical dashed reference line using geom_vline function |
| Dolphins | 24:05 | Starting analysis of acquisition location |
| Dolphins | 27:05 | Matching messy text data with regex to aggregate into a few categories variables with fuzzyjoin package |
| Dolphins | 31:30 | Using distinct function's .keep_all argument to keep only one row per animal ID |
| Dolphins | 33:10 | Using coalesce function to conditionally replace NAs (same functionality as SQL verb) |
| Dolphins | 40:00 | Starting survival analysis |
| Dolphins | 46:25 | Using survfit function from survival package to get a baseline survival curve (i.e., not regressed on any independent variables) |
| Dolphins | 47:30 | Fixing cases where death year is before birth year |
| Dolphins | 48:30 | Fixing specification of survfit model to better fit the format of our data (right-censored data) |
| Dolphins | 50:10 | Built-in plot of baseline survival model (estimation of percentage survival at a given age) |
| Dolphins | 50:30 | Using broom package to tidy the survival model data (which is better for ggplot2 plotting) |
| Dolphins | 52:20 | Fitting survival curve based on sex |
| Dolphins | 54:25 | Cox proportional hazards model (to investigate association of survival time and one or more predictors) |
| Dolphins | 55:50 | Explanation of why dolphins with unknown sex likely have a systematic bias with their data |
| Dolphins | 57:25 | Investigating whether being born in captivity is associated with different survival rates |
| Dolphins | 1:00:10 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| TidyTuesday Tweets | 1:20 | Importing an rds file using read_rds function |
| TidyTuesday Tweets | 2:55 | Using floor_date function from lubridate package to round dates down (that's what the floor part does) to the month level |
| TidyTuesday Tweets | 5:25 | Asking, "Which tweets get the most re-tweets?" |
| TidyTuesday Tweets | 5:50 | Using contains function to select only columns that contain a certain string ("retweet" in this case) |
| TidyTuesday Tweets | 8:05 | Exploring likes/re-tweets ratio, including dealing with one or the other being 0 (which would cause divide by zero error) |
| TidyTuesday Tweets | 11:00 | Starting exploration of actual text of tweets |
| TidyTuesday Tweets | 11:35 | Using unnest_tokens function from tidytext package to break tweets into individual words (using token argument specifically for tweet-style text) |
| TidyTuesday Tweets | 12:55 | Using anti_join function to filter out stop words (e.g., "and", "or", "the") from tokenized data frame |
| TidyTuesday Tweets | 14:45 | Calculating summary statistics per word (average retweets and likes), then looking at distributions |
| TidyTuesday Tweets | 16:00 | Explanation of Poisson log normal distribution (number of retweets fits this distribution) |
| TidyTuesday Tweets | 17:45 | Additional example of Poisson log normal distribution (number of likes) |
| TidyTuesday Tweets | 18:20 | Explanation of geometric mean as better summary statistic than median or arithmetic mean |
| TidyTuesday Tweets | 25:20 | Using floor_date function from lubridate package to floor dates to the week level and tweaking so that a week starts on Monday (default is Sunday) |
| TidyTuesday Tweets | 30:20 | Asking, "What topic is each week about?" using just the tweet text |
| TidyTuesday Tweets | 31:30 | Calculating TF-IDF of tweets, with week as the "document" |
| TidyTuesday Tweets | 33:45 | Using top_n and group_by functions to select the top tf-idf score for each week |
| TidyTuesday Tweets | 37:55 | Using str_detect function to filter out "words" that are just numbers (e.g., 16, 36) |
| TidyTuesday Tweets | 41:00 | Using distinct function with .keep_all argument to ensure only top 1 result, as alternative to top_n function (which includes ties) |
| TidyTuesday Tweets | 42:30 | Making Jenny Bryan disappointed |
| TidyTuesday Tweets | 42:55 | Using geom_text function to add text labels to graph to show to word associated with each week |
| TidyTuesday Tweets | 44:10 | Using geom_text_repel function from ggrepel package as an alternative to geom_text function for adding text labels to graph |
| TidyTuesday Tweets | 46:30 | Using rvest package to scrape web data from a table in Tidy Tuesday README |
| TidyTuesday Tweets | 51:00 | Starting to look at #rstats tweets |
| TidyTuesday Tweets | 56:35 | Spotting signs of fake accounts with purchased followers (lots of hashtags) |
| TidyTuesday Tweets | 59:15 | Explanation of spotting fake accounts |
| TidyTuesday Tweets | 1:00:45 | Using str_detect to filter out web URLs |
| TidyTuesday Tweets | 1:03:55 | Using str_count function and some regex to count how many hashtags a tweet has |
| TidyTuesday Tweets | 1:07:25 | Creating a Bland-Altman plot (total on x-axis, variable of interest on y-axis) |
| TidyTuesday Tweets | 1:08:45 | Using geom_text function with check_overlap argument to add labels to scatterplot |
| TidyTuesday Tweets | 1:12:20 | Asking, "Who are the most active #rstats tweeters?" |
| TidyTuesday Tweets | 1:15:00 | Summary of screncast |
| Screencast | Time | Description |
|---|---|---|
| TV Golden Age | 2:25 | Quick tip on how to start exploring a new dataset |
| TV Golden Age | 7:30 | Investigating inconsistency of shows having a count of seasons that is different from the number of seasons given in the data |
| TV Golden Age | 10:10 | Using %in% operator and all function to only get shows that have a first season and don't have skipped seasons in the data |
| TV Golden Age | 15:30 | Asking, "Which seasons have the most variation in ratings?" |
| TV Golden Age | 20:25 | Using facet_wrap function to separate different shows on a line graph into multiple small graphs |
| TV Golden Age | 20:50 | Writing custom embedded function to get width of breaks on the x-axis to always be even (e.g., season 2, 4, 6, etc.) |
| TV Golden Age | 23:50 | Committing, finding, and explaining a common error of using the same variable name when summarizing multiple things |
| TV Golden Age | 28:20 | Using truncated division operator %/% to bin data into two-year bins instead of annual (e.g., 1990 and 1991 get binned to 1990) |
| TV Golden Age | 31:30 | Using subsetting (with square brackets) within the mutate function to calculate mean on only a subset of data (without needing to filter) |
| TV Golden Age | 33:50 | Using gather function (now pivot_longer) to get metrics as columns into tidy format, in order to graph them all at once with a facet_wrap |
| TV Golden Age | 36:30 | Using pmin function to lump all seasons after 4 into one row (it still shows "4", but it represents "4+") |
| TV Golden Age | 39:00 | Asking, "If season 1 is good, do you get a second season?" (show survival) |
| TV Golden Age | 40:35 | Using paste0 and spread functions to get season 1-3 ratings into three columns, one for each season |
| TV Golden Age | 42:05 | Using distinct function with .keep_all argument remove duplicates by only keeping the first one that appears |
| TV Golden Age | 45:50 | Using logistic regression to answer, "Does season 1 rating affect the probability of getting a second season?" (note he forgets to specify the family argument, fixed at 57:25) |
| TV Golden Age | 48:35 | Using ntile function to divide data into N bins (5 in this case), then eventually using cut function instead |
| TV Golden Age | 57:00 | Adding year as an independent variable to the logistic regression model |
| TV Golden Age | 58:50 | Adding an interaction term (season 1 interacting with year) to the logistic regression model |
| TV Golden Age | 59:55 | Using augment function as a method of visualizing and interpreting coefficients of regression model |
| TV Golden Age | 1:00:30 | Using crossing function to create new data to test the logistic regression model on and interpret model coefficients |
| TV Golden Age | 1:03:40 | Fitting natural splines using the splines package, which would capture a non-linear relationship |
| TV Golden Age | 1:06:15 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Space Launches | 4:40 | Using str_detect function to find missions with "Apollo" in their name |
| Space Launches | 6:20 | Starting EDA (exploratory data analysis) |
| Space Launches | 15:10 | Using fct_collapse function to recode factors (similar to case_when function) |
| Space Launches | 16:45 | Using countrycode function from countrycode package to get full country names from country codes (e.g. "RU" becomes "Russia") |
| Space Launches | 18:15 | Using replace_na function to convert NA (missing) observations to "Other" |
| Space Launches | 19:10 | Creating a line graph using geom_line function with different colours for different categories |
| Space Launches | 21:05 | Using fct_reorder function to reorder factors in line graph above, in order to make legend more readable |
| Space Launches | 32:00 | Creating a bar graph, using geom_col function, of most active (by number of launches) private or startup agencies |
| Space Launches | 35:05 | Using truncated division operator %/% to bin data into decades |
| Space Launches | 35:35 | Using complete function to turn implicit zeros into explicit zeros (makes for a cleaner line graph) |
| Space Launches | 37:15 | Using facet_wrap function to create small multiples of a line graph, then proceeding to tweak the graph |
| Space Launches | 42:50 | Using semi_join function as a filtering step |
| Space Launches | 43:15 | Using geom_point to create a timeline of launches by vehicle type |
| Space Launches | 47:20 | Explanation of why boxplots over time might not be a good visualization choice |
| Space Launches | 48:00 | Using geom_jitter function to tweak the timeline graph to be more readable |
| Space Launches | 51:30 | Creating a second timeline graph for US vehicles and launches |
| Space Launches | 56:35 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| US Incarceration | 4:30 | Creating a facetted (small multiples) line graph of incarceration rate by urbanicity and race over time |
| US Incarceration | 7:45 | Discussion of statistical testing of incarceration rates by urbanicity (e.g., rural, suburban) |
| US Incarceration | 11:25 | Exploring the extent of missing data on prison population |
| US Incarceration | 14:15 | Using any function to filter down to states that have at least one (hence the any function) row of non-missing data |
| US Incarceration | 18:40 | Using cut function to manually bin data along user-specified intervals |
| US Incarceration | 24:15 | Starting to create a choropleth map of incarceration rate by state |
| US Incarceration | 26:20 | Using match function to match two-letter state abbreviation to full state name, in order to get data needed to create a map |
| US Incarceration | 28:00 | Actually typing the code (now that we have the necessary data) to create a choropleth map |
| US Incarceration | 33:05 | Using str_remove function and regex to chop off the end of county names (e.g., "Allen Parish" becomes "Allen") |
| US Incarceration | 33:30 | Making choropleth more specific by drilling down to county-level data |
| US Incarceration | 41:10 | Starting to make an animated choropleth map using gganimate package |
| US Incarceration | 42:20 | Using modulo operator %% to choose every 5th year |
| US Incarceration | 43:45 | Using scale_fill_gradient2 function's limits argument to exclude unusally high values that were blowing out the scale |
| US Incarceration | 48:15 | Using summarise_at function to apply the same function to multiple fields at the same time |
| US Incarceration | 50:10 | Starting to investigate missing data (how much is missing, where is it missing, etc.) |
| US Incarceration | 54:50 | Creating a line graph that excludes counties with missing data |
| US Incarceration | 57:05 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| US Dairy Consumption | 2:50 | Identifying the need for a gather step |
| US Dairy Consumption | 4:40 | Changing snake case to title case using str_to_title and str_replace_all functions |
| US Dairy Consumption | 6:20 | Identifying need for separating categories into major and minor categories (e.g., "Cheese Other" can be divided into "Cheese" and "Other") |
| US Dairy Consumption | 7:10 | Using separate function to split categories into major and minor categories (good explanation of "extra" argument, which merges additional separations into one field) |
| US Dairy Consumption | 8:20 | Using coalesce function to deal with NAs resulting from above step |
| US Dairy Consumption | 10:30 | Dealing with graph of minor category that is linked to multiple major categories ("Other" linked to "Cheese" and "Frozen") |
| US Dairy Consumption | 13:10 | Introducing fct_lump function as an approach to work with many categories |
| US Dairy Consumption | 14:50 | Introducing facetting (facet_wrap function) as second alternative to working with many categories |
| US Dairy Consumption | 15:50 | Dealing with "Other" category having two parts to it by using ifelse function in the cleaning step (e.g., go from "Other" to "Other Cheese") |
| US Dairy Consumption | 19:45 | Looking at page for the sweep package |
| US Dairy Consumption | 21:20 | Using tk_ts function to coerce a tibble to a timeseries |
| US Dairy Consumption | 22:10 | Turning year column (numeric) into a date by adding number of years to Jan 1, 0001 |
| US Dairy Consumption | 26:00 | Nesting time series object into each combination of category and product |
| US Dairy Consumption | 27:50 | Applying ETS (Error, Trend, Seasonal) model to each time series |
| US Dairy Consumption | 28:10 | Using sw_glance function (sweep package's version of glance function) to pull out model parameters from model field created in above step |
| US Dairy Consumption | 29:45 | Using sw_augment function to append fitted values and residuals from the model to the original data |
| US Dairy Consumption | 30:50 | Visualising actual and fitted values on the same graph to get a look at the ETS model |
| US Dairy Consumption | 32:10 | Using Arima function (note the capital A) as alternative to ETS (not sure what difference is between arima and Arima) |
| US Dairy Consumption | 35:00 | Forecasting into the future using an ETS model using various functions: unnest, sw_sweep, forecast |
| US Dairy Consumption | 37:45 | Using geom_ribbon function to add confidence bounds to forecast |
| US Dairy Consumption | 40:20 | Forecasting using auto-ARIMA (instead of ETS) |
| US Dairy Consumption | 40:55 | Applying two forecasting methods at the same time (auto-ARIMA and ETS) using the crossing function |
| US Dairy Consumption | 41:55 | Quick test of how invoke function works (used to call a function easily, e.g., when it is a character string instead of called directly) |
| US Dairy Consumption | 47:35 | Removing only one part of legend (line type of solid or dashed) using scale_linetype_discrete function |
| US Dairy Consumption | 51:25 | Using gather function to clean up new dataset |
| US Dairy Consumption | 52:05 | Using fct_recode to fix a typo in a categorical variable |
| US Dairy Consumption | 56:00 | Copy-pasting previous forecasting code to cheese and reviewing any changes needed |
| US Dairy Consumption | 57:20 | Discussing alternative approach: creating interactive visualisation using shiny package to do direct comparisons |
| Screencast | Time | Description |
|---|---|---|
| US PhDs | 3:15 | Using read_xlsx function to read in Excel spreadsheet, including skipping first few rows that don't have data |
| US PhDs | 7:25 | Overview of starting very messy data |
| US PhDs | 8:20 | Using gather function to clean up wide dataset |
| US PhDs | 9:20 | Using fill function to fill in NA values with a entries in a previous observation |
| US PhDs | 10:10 | Cleaning variable that has number and percent in it, on top of one another using a combination of ifelse and fill functions |
| US PhDs | 12:00 | Using spread function on cleaned data to separate number and percent by year |
| US PhDs | 13:50 | Spotted a mistake where he had the wrong string on str_detect function |
| US PhDs | 16:50 | Using sample function to get 6 random fields of study to graph |
| US PhDs | 18:50 | Cleaning another dataset, which is much easier to clean |
| US PhDs | 19:05 | Renaming the first field, even without knowing the exact name |
| US PhDs | 21:55 | Cleaning another dataset |
| US PhDs | 23:10 | Discussing challenge of when indentation is used in original dataset (for group / sub-group distinction) |
| US PhDs | 25:20 | Starting to separate out data that is appended to one another in the original dataset (all, male, female) |
| US PhDs | 27:30 | Removing field with long name using contains function |
| US PhDs | 28:10 | Using fct_recode function to rename an oddly-named category in a categorical variable (ifelse function is probably a better alternative) |
| US PhDs | 35:30 | Discussing solution to broad major field description and fine major field description (meaningfully indented in original data) |
| US PhDs | 39:40 | Using setdiff function to separate broad and fine major fields |
| Screencast | Time | Description |
|---|---|---|
| French Train Delays | 10:20 | Boxplots of departure stations using fct_lump function |
| French Train Delays | 14:25 | Creating heat map of departure and arrival delays, then cleaning up a sparse heat map |
| French Train Delays | 15:30 | Using fct_reorder function and length function to reorder stations based on how frequently they appear |
| French Train Delays | 16:30 | Using fct_infreq to reorder based on infrequently-appearing stations (same as above, but without a trick needed) |
| French Train Delays | 17:45 | Using fct_lump function to lump based on proportion instead of number of top categories desired |
| French Train Delays | 18:45 | Using scale_fill_gradient2 function to specify diverging colour scale |
| French Train Delays | 26:00 | Checking another person's take on the data, which is a heatmap over time |
| French Train Delays | 28:40 | Converting year and month (as digits) into date-class variable using sprintf function and padding month number with extra zero when necessary |
| French Train Delays | 34:50 | Using summarise_at function to quickly sum multiple columns |
| French Train Delays | 39:35 | Creating heatmap using geom_tile function for percentage of late trains by station over time |
| French Train Delays | 45:05 | Using fill function to fill in missing NA values with data from previous observations |
| French Train Delays | 50:35 | Grouping multiple variables into a single category using paste0 function |
| French Train Delays | 51:40 | Grouping heatmap into International / National chunks with a weird hack |
| French Train Delays | 52:20 | Further separating International / National visually |
| French Train Delays | 53:30 | Less hacky way of separating International / National (compared to previous two rows) |
| Screencast | Time | Description |
|---|---|---|
| Women in the Workplace | 5:50 | Writing a custom function that summarizes variables based on their names (then abandoning the idea) |
| Women in the Workplace | 9:15 | Using complete.cases function to find observations that have an NA value in any variable |
| Women in the Workplace | 9:50 | Using subsetting within a summarise function to calculate a weighted mean when dealing with 0 or NA values in some observations |
| Women in the Workplace | 12:20 | Debugging what is causing NA values to appear in the summarise output (finds the error at 13:25) |
| Women in the Workplace | 17:50 | Hypothesizing about one sector illustrating a variation of Simpson's Paradox |
| Women in the Workplace | 25:25 | Creating a scatterplot with a logarithmic scale and using scale_colour_gradient2 function to encode data to point colour |
| Women in the Workplace | 30:00 | Creating an interactive plot (tooltips show up on hover) using ggplotly function from plotly package |
| Women in the Workplace | 33:20 | Fiddling with scale_size_continuous function's range argument to specify point size on a scatterplot (which are encoded to total workers) |
| Women in the Workplace | 34:50 | Explanation of why healthcare sector is a good example of Simpson's Paradox |
| Women in the Workplace | 43:15 | Starting to create a shiny app with "occupation" as only input (many tweaks in subsequent minutes to make it work) |
| Women in the Workplace | 47:55 | Tweaking size (height) of graph in shiny app |
| Women in the Workplace | 54:05 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Board Game Reviews | 2:50 | Starting EDA (exploratory data analysis) with counts of categorical variables |
| Board Game Reviews | 7:25 | Specifying scale_x_log10 function's breaks argument to get sensisble tick marks for time on histogram |
| Board Game Reviews | 8:45 | Tweaking geom_histogram function's binwidth argument to get something that makes sense for log scale |
| Board Game Reviews | 10:10 | Using separate_rows to break down comma-separated values for three different categorical variables |
| Board Game Reviews | 15:55 | Using top_n to get top 20 observations from each of several categories (not quite right, fixed at 17:47) |
| Board Game Reviews | 16:15 | Troubleshooting various issues with facetted graph (e.g., ordering, values appearing in multiple categories) |
| Board Game Reviews | 19:55 | Starting prediction of average rating with a linear model |
| Board Game Reviews | 20:50 | Splitting data into train/test sets (training/holdout) |
| Board Game Reviews | 22:55 | Investigating relationship between max number of players and average rating (to determine if it should be in linear model) |
| Board Game Reviews | 25:05 | Exploring average rating over time ("Do newer games tend to be rated higher/lower?") |
| Board Game Reviews | 27:35 | Discussing necessity of controlling for year a game was published in the linear model |
| Board Game Reviews | 28:30 | Non-model approach to exploring relationship between game features (e.g., card game, made in Germany) on average rating |
| Board Game Reviews | 30:50 | Using geom_boxplot function to create boxplot of average ratings for most common game features |
| Board Game Reviews | 34:05 | Using unite function to combine multiple variables into one |
| Board Game Reviews | 37:25 | Introducing Lasso regression as good option when you have many features likely to be correlated with one another |
| Board Game Reviews | 38:15 | Writing code to set up Lasso regression using glmnet and tidytext packages |
| Board Game Reviews | 40:05 | Adding average rating to the feature matrix (warning: method is messy) |
| Board Game Reviews | 41:40 | Using setdiff function to find games that are in one set, but not in another (while setting up matrix for Lasso regression) |
| Board Game Reviews | 44:15 | Spotting the error stemming from the step above (calling row names from the wrong data) |
| Board Game Reviews | 45:45 | Explaining what a Lasso regression does, including the penalty parameter lambda |
| Board Game Reviews | 48:35 | Using a cross-validated Lasso model to choose the level of the penalty parameter (lambda) |
| Board Game Reviews | 51:35 | Adding non-categorical variables to the Lasso model to control for them (e.g., max number of players) |
| Board Game Reviews | 55:15 | Using unite function to combine multiple variables into one, separated by a colon |
| Board Game Reviews | 58:45 | Graphing the top 20 coefficients in the Lasso model that have the biggest effect on predicted average rating |
| Board Game Reviews | 1:00:55 | Mentioning the yardstick package as a way to evaluate the model's performance |
| Board Game Reviews | 1:01:15 | Discussing drawbacks of linear models like Lasso (can't do non-linear relationships or interaction effects) |
| Screencast | Time | Description |
|---|---|---|
| Seattle Pet Names | 2:40 | Using mdy function from lubridate package to convert character-formatted date to date-class |
| Seattle Pet Names | 4:20 | Exploratory bar graph showing top species of cats, using geom_col function |
| Seattle Pet Names | 6:30 | Specifying facet_wrap function's ncol argument to get graphs stacked vertically (instead of side-by-side) |
| Seattle Pet Names | 9:55 | Asking, "Are some animal names associated with particular dog breeds?" |
| Seattle Pet Names | 11:15 | Explanation of add_count function |
| Seattle Pet Names | 12:35 | Adding up various metrics (e.g., number of names overall, number of breeds overall), but note a mistake that gets fixed at 17:05 |
| Seattle Pet Names | 16:10 | Calculating a ratio for names that appear over-represented within a breed, then explaining how small samples can be misleading |
| Seattle Pet Names | 17:05 | Spotting and fixing an aggregation mistake |
| Seattle Pet Names | 17:55 | Explanation of how to investigate which names might be over-represented within a breed |
| Seattle Pet Names | 18:55 | Explanation of how to use hypergeometric distribution to test for name over-representation |
| Seattle Pet Names | 20:40 | Using phyper function to calculate p-values for a one-sided hypergeometric test |
| Seattle Pet Names | 23:30 | Additional explanation of hypergeometric distribution |
| Seattle Pet Names | 24:00 | First investigation of why and how to interpret a p-value histogram (second at 29:45, third at 37:45, and answer at 39:30) |
| Seattle Pet Names | 25:15 | Noticing that we are missing zeros (i.e., having a breed/name combination with 0 dogs), which is important for the hypergeometric test |
| Seattle Pet Names | 27:10 | Using complete function to turn implicit zeros (for breed/name combination) into explicit zeros |
| Seattle Pet Names | 29:45 | Second investigation of p-value histogram (after adding in implicit zeros) |
| Seattle Pet Names | 31:55 | Explanation of multiple hypothesis testing and correction methods (e.g., Bonferroni, Holm), and applying using p.adjust function |
| Seattle Pet Names | 34:25 | Explanation of False Discovery Rate (FDR) control as a method for correcting for multiple hypothesis testing, and applying using p.adjust function |
| Seattle Pet Names | 37:45 | Third investigation of p-value histogram, to hunt for under-represented names |
| Seattle Pet Names | 39:30 | Answer to why the p-value distribution is not well-behaved |
| Seattle Pet Names | 42:40 | Using crossing function to created a simulated dataset to explore how different values affect the p-value |
| Seattle Pet Names | 44:55 | Explanation of how total number of names and total number of breeds affects p-value |
| Seattle Pet Names | 46:00 | More general explanation of what different shapes of p-value histogram might indicate |
| Seattle Pet Names | 47:30 | Renaming variables within a transmute function, using backticks to get names with spaces in them |
| Seattle Pet Names | 49:20 | Using kable function from the knitr package to create a nice-looking table |
| Seattle Pet Names | 50:00 | Explanation of one-side p-value (as opposed to two-sided p-value) |
| Seattle Pet Names | 53:55 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Seattle Bike Counts | 6:15 | Using summarise_all and summarise_at functions to aggregate multiple variables at the same time |
| Seattle Bike Counts | 8:15 | Using magnitude instead of absolute numbers to see trends in time of day |
| Seattle Bike Counts | 12:00 | Dividing time into categories (four categories for times of day, e.g., morning commute, night) using between function |
| Seattle Bike Counts | 15:00 | Looking for systematically missing data (which would bias the results of the analysis) |
| Seattle Bike Counts | 19:45 | Summarising using a filter in the arguments based on whether the time window is during a commute time |
| Seattle Bike Counts | 22:45 | Combining day of week and hour using functions in the lubridate package and as.difftime function (but then he uses facetting as an easier method) |
| Seattle Bike Counts | 26:30 | Normalizing day of week data to percent of weekly traffic |
| Seattle Bike Counts | 42:00 | Starting analysis of directions of travel by time of day (commute vs. reverse-commute) |
| Seattle Bike Counts | 43:45 | Filtering out weekend days using wday function from lubridate package |
| Seattle Bike Counts | 45:30 | Using spread function to create new variable of ratio of bike counts at different commute times |
| Seattle Bike Counts | 47:30 | Visualizing ratio of bike counts by time of day |
| Seattle Bike Counts | 50:15 | Visualizing ratio by hour instead of time of day |
| Seattle Bike Counts | 52:50 | Ordering crossing in graph by when the average trip happened using mean of hour weighted by bike count |
| Seattle Bike Counts | 54:50 | Quick and dirty filter when creating a new variable within a mutate function |
| Screencast | Time | Description |
|---|---|---|
| Tennis Tournaments | 5:00 | Identifying duplicated rows ands fixing them |
| Tennis Tournaments | 11:15 | Using add_count and fct_reorder functions to order categories that are broken down into sub-categories for graphing |
| Tennis Tournaments | 13:00 | Tidying graph titles (e.g., replacing underscores with spaces) using str_to_title and str_replace functions |
| Tennis Tournaments | 15:00 | Using inner_join function to merge datasets |
| Tennis Tournaments | 15:30 | Calculating age from date of birth using difftime and as.numeric functions |
| Tennis Tournaments | 16:35 | Adding simple calculations like mean and median into the text portion of markdown document |
| Tennis Tournaments | 17:45 | Looking at distribution of wins by sex using overlapping histograms |
| Tennis Tournaments | 18:55 | Binning years into decades using truncated division %/% |
| Tennis Tournaments | 20:15 | Splitting up boxplots so that they are separated into pairs (M/F) across a different group (decade) using interaction function |
| Tennis Tournaments | 20:30 | Analyzing distribution of ages across decades, looking specifically at the effect of Serena Williams (one individual having a disproportionate affect on the data, making it look like there's a trend) |
| Tennis Tournaments | 24:30 | Avoiding double-counting of individuals by counting their average age instead of their age at each win |
| Tennis Tournaments | 30:20 | Starting analysis to predict winner of Grand Slam tournaments |
| Tennis Tournaments | 35:00 | Creating rolling count using row_number function to make a count of previous tournament experience |
| Tennis Tournaments | 39:45 | Creating rolling win count using cumsum function |
| Tennis Tournaments | 41:00 | Lagging rolling win count using lag function (otherwise we get information about a win before a player has actually won, for prediction purposes) |
| Tennis Tournaments | 43:30 | Asking, "When someone is a finalist, what is their probability of winning as a function of previous tournaments won?" |
| Tennis Tournaments | 48:00 | Asking, "How does the number of wins a finalist has affect their chance of winning?" |
| Tennis Tournaments | 49:00 | Backtesting simple classifier where person with more tournament wins is predicted to win the given tournament |
| Tennis Tournaments | 51:45 | Creating classifier that gives points based on how far a player got in previous tournaments |
| Tennis Tournaments | 52:55 | Using match function to turn name of round reached (1st round, 2nd round, …) into a number score (1, 2, …) |
| Tennis Tournaments | 54:20 | Using cummean function to get score of average past performance (instead of cumsum function) |
| Tennis Tournaments | 1:04:10 | Pulling names of rounds (1st round, 2nd round, … ) based on the rounded numeric score of previous performance |
| Screencast | Time | Description |
|---|---|---|
| Bird Collisions | 2:45 | Analyzing when NAs appear in a dimension |
| Bird Collisions | 7:30 | Looking at multiple categorical variable at the same time by gathering them into one column and eventually graphing each as a different facet |
| Bird Collisions | 9:30 | Re-order facet graphs according to which ones have the fewest categories in them to ones that have the most |
| Bird Collisions | 20:45 | Geometric mean for estimating counts when there are a lot of low values (1-3 bird collisions, in this case) |
| Bird Collisions | 23:15 | Filling in "blank" observations where there were no observations made |
| Bird Collisions | 27:00 | Using log+1 to convert a dimension with values of 0 into a log scale |
| Bird Collisions | 29:00 | Adding confidence bounds for data using a geometric mean (where he first gets the idea of bootstrapping) |
| Bird Collisions | 32:00 | Actual coding of bootstrap starts |
| Bird Collisions | 38:30 | Adding confidence bounds using bootstrap data |
| Bird Collisions | 42:00 | Investigating potential confounding variables |
| Bird Collisions | 44:15 | Discussing approaches to dealing with confounding variables |
| Bird Collisions | 46:45 | Using complete function to get explicit NA values |
| Screencast | Time | Description |
|---|---|---|
| Student-Teacher Ratios | 7:30 | Using slice function to select 10 highest and 10 lowest student-teacher ratios (like a filter using row numbers) |
| Student-Teacher Ratios | 12:35 | Adding GDP per capita to a dataset using WDI package |
| Student-Teacher Ratios | 17:40 | Using geom_text to add labels to points on a scatterplot |
| Student-Teacher Ratios | 19:00 | Using WDIsearch function from WDI package to search for country population data |
| Student-Teacher Ratios | 23:20 | Explanation of trick with geom_text function's check_overlap argument to get label for US to appear by rearranging row order |
| Student-Teacher Ratios | 25:45 | Using comma_format function from scales format to get more readable numeric legend (e.g., "500,000,000" instead of "5e+08") |
| Student-Teacher Ratios | 27:55 | Exploring different education-related indicators in the WDI package |
| Student-Teacher Ratios | 31:55 | Using spread function (now pivot_wider) to turn data from tidy to wide format |
| Student-Teacher Ratios | 32:15 | Using to_snake_case function from snakecase package to convert field names to snake_case |
| Student-Teacher Ratios | 48:30 | Exploring female/male school secondary school enrollment |
| Student-Teacher Ratios | 51:50 | Note of caution on keeping confounders in mind when interpreting scatterplots |
| Student-Teacher Ratios | 52:30 | Creating a linear regression of secondary school enrollment to explore confounders |
| Student-Teacher Ratios | 54:30 | Discussing the actual confounder (GDP per capita) in the linear regression above |
| Student-Teacher Ratios | 57:20 | Adding world region as another potential confounder |
| Student-Teacher Ratios | 58:00 | Using aov function (ANOVA) to explore confounders further |
| Student-Teacher Ratios | 1:06:50 | Reviewing and interpreting the final linear regression model |
| Student-Teacher Ratios | 1:08:00 | Using cor function (correlation) to get correlation matrix for three variables (and brief explanation of multi-collinearity) |
| Student-Teacher Ratios | 1:10:10 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Nobel Prize Winners | 2:00 | Creating a stacked bar plot using geom_col and the aes function's fill argument (also bins years into decades with truncated division operator %/%) |
| Nobel Prize Winners | 3:30 | Using n_distinct function to quickly count unique years in a group |
| Nobel Prize Winners | 9:00 | Using distinct function and its .keep_all argument to de-duplicate data |
| Nobel Prize Winners | 10:50 | Using coalesce function to replace NAs in a variable (similar to SQL COALESCE verb) |
| Nobel Prize Winners | 16:10 | Using year function from lubridate package to calculate (approx.) age of laureates at time of award |
| Nobel Prize Winners | 16:50 | Using fct_reorder function to arrange boxplot graph by the median age of winners |
| Nobel Prize Winners | 22:50 | Defining a new variable within the count function (like doing a mutate in the count function) |
| Nobel Prize Winners | 23:40 | Creating a small multiples bar plot using geom_col and facet_wrap functions |
| Nobel Prize Winners | 26:15 | Importing income data from WDI package to explore relationship between high/low income countries and winners |
| Nobel Prize Winners | 33:45 | Using fct_relevel to change the levels of a categorical income variable (e.g., "Upper middle income") so that the ordering makes sense |
| Nobel Prize Winners | 36:25 | Starting to explore new dataset of nobel laureate publications |
| Nobel Prize Winners | 44:25 | Taking the mean of a subset of data without needing to fully filter the data beforehand |
| Nobel Prize Winners | 49:15 | Using rank function and its ties.method argument to add the ordinal number of a laureate's publication (e.g., 1st paper, 2nd paper) |
| Nobel Prize Winners | 1:05:10 | Lots of playing around with exploratory histograms (geom_histogram) |
| Nobel Prize Winners | 1:06:45 | Discussion of right-censoring as an issue (people winning the Nobel prize but still having active careers) |
| Nobel Prize Winners | 1:10:20 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Plastic Waste | 1:45 | Using summarise_all to get proportion of NA values across many variables |
| Plastic Waste | 16:50 | Adding text labels to scatter plot for some points using check_overlap argument |
| Plastic Waste | 21:45 | Using pmin function to get the lower of two possible numbers for a percentage variable that was showing > 100% |
| Plastic Waste | 29:00 | Starting to make a choropleth map |
| Plastic Waste | 29:30 | Connecting ISO country names (used in mapping code) to country names given in the dataset |
| Plastic Waste | 32:00 | Actual code to create the map using given longitude and latitude |
| Plastic Waste | 33:45 | Using fuzzyjoin package to link variables that use regular expression instead of character (using regex_right_join / regex_left_join function) |
| Plastic Waste | 36:15 | Using coord_fixed function as a hack to get proper ratios for maps |
| Plastic Waste | 39:30 | Bringing in additional data using WDI package |
| Plastic Waste | 47:30 | Using patchwork package to show multiple graphs in the same plot |
| Plastic Waste | 53:00 | Importing and renaming multiple indicators from the WDI package at the same time |
| Screencast | Time | Description |
|---|---|---|
| Wine Ratings | 3:15 | Using extract function from tidyr package to pull out year from text field |
| Wine Ratings | 9:15 | Changing extract function to pull out year column more accurately |
| Wine Ratings | 13:00 | Starting to explore prediction of points |
| Wine Ratings | 17:00 | Using fct_lump on country variable to collapse countries into an "Other" category, then fct_relevel to set the baseline category for a linear model |
| Wine Ratings | 21:30 | Investigating year as a potential confounding variable |
| Wine Ratings | 24:45 | Investigating "taster_name" as a potential confounding variable |
| Wine Ratings | 27:45 | Coefficient (TIE fighter) plot to see effect size of terms in a linear model, using tidy function from broom package |
| Wine Ratings | 30:45 | Polishing category names for presentation in graph using str_replace function |
| Wine Ratings | 32:15 | Using augment function to add predictions of linear model to original data |
| Wine Ratings | 33:30 | Plotting predicted points vs. actual points |
| Wine Ratings | 34:45 | Using ANOVA to determine the amount of variation that explained by different terms |
| Wine Ratings | 36:45 | Using tidytext package to set up wine review text for Lasso regression |
| Wine Ratings | 40:00 | Setting up and using pairwise_cor function to look at words that appear in reviews together |
| Wine Ratings | 45:00 | Creating sparse matrix using cast_sparse function from tidytext package; used to perform a regression on positive/negative words |
| Wine Ratings | 46:45 | Checking if row names of sparse matrix correspond to the wine_id values they represent |
| Wine Ratings | 47:00 | Setting up sparse matrix for using glmnet package to do sparse regression using Lasso method |
| Wine Ratings | 48:15 | Actually writing code for doing Lasso regression |
| Wine Ratings | 49:45 | Basic explanation of Lasso regression |
| Wine Ratings | 51:00 | Putting Lasso model into tidy format |
| Wine Ratings | 53:15 | Explaining how the number of terms increases as lambda (penalty parameter) decreases |
| Wine Ratings | 54:00 | Answering how we choose a lambda value (penalty parameter) for Lasso regression |
| Wine Ratings | 56:45 | Using parallelization for intensive computations |
| Wine Ratings | 58:30 | Adding price (from original linear model) to Lasso regression |
| Wine Ratings | 1:02:15 | Shows glmnet.fit piece of a Lasso model (using glmnet package) |
| Wine Ratings | 1:03:30 | Picking a lambda value (penalty parameter) and explaining which one to pick |
| Wine Ratings | 1:08:15 | Taking most extreme coefficients (positive and negative) by grouping theme by direction |
| Wine Ratings | 1:10:30 | Demonstrating tidytext package's sentiment lexicon, then looking at individual reviews to demonstrate the model |
| Wine Ratings | 1:17:30 | Visualizing each coefficient's effect on a single review |
| Wine Ratings | 1:20:30 | Using str_trunc to truncate character strings |
| Screencast | Time | Description |
|---|---|---|
| Ramen Reviews | 1:45 | Looking at the website the data came from |
| Ramen Reviews | 2:55 | Using gather function (now pivot_longer) to convert wide data to long (tidy) format |
| Ramen Reviews | 4:15 | Graphing counts of all categorical variables at once, then exploring them |
| Ramen Reviews | 5:35 | Using fct_lump function to lump three categorical variables to the top N categories and "Other" |
| Ramen Reviews | 7:45 | Using reorder_within function to re-order factors that have the same name across multiple facets |
| Ramen Reviews | 9:10 | Using lm function (linear model) to predict star rating |
| Ramen Reviews | 9:50 | Visualising effects (and 95% CI) of indendent variables in linear model with a coefficient plot (TIE fighter plot) |
| Ramen Reviews | 11:30 | Using fct_relevel function to get "Other" as the base reference level for categorical independent variables in a linear model |
| Ramen Reviews | 13:05 | Using extract function and regex to split a camelCase variable into two separate variables |
| Ramen Reviews | 14:45 | Using facet_wrap function to split coefficient / TIE fighter plot into three separate plots, based on type of coefficient |
| Ramen Reviews | 15:40 | Using geom_vline function to add reference line to graph |
| Ramen Reviews | 17:20 | Using unnest_tokens function from tidytext package to explore the relationship between variety (a sparse categorical variable) and star rating |
| Ramen Reviews | 18:55 | Explanation of how he would approach variety variable with Lasso regression |
| Ramen Reviews | 19:35 | Web scraping the using rvest package and SelectorGadget (Chrome Extension CSS selector) |
| Ramen Reviews | 21:20 | Actually writing code for web scraping, using read_html, html_node, and html_table functions |
| Ramen Reviews | 22:25 | Using clean_names function from janitor package to clean up names of variables |
| Ramen Reviews | 23:05 | Explanation of web scraping task: get full review text using the links from the review summary table scraped above |
| Ramen Reviews | 25:40 | Using parse_number function as alternative to as.integer function to cleverly drop extra weird text in review number |
| Ramen Reviews | 26:45 | Using SelectorGadget (Chrome Extension CSS selector) to identify part of page that contains review text |
| Ramen Reviews | 27:35 | Using html_nodes, html_text, and str_subset functions to write custom function to scrape review text identified in step above |
| Ramen Reviews | 29:15 | Adding message function to custom scraping function to display URLs as they are being scraped |
| Ramen Reviews | 30:15 | Using unnest_tokens and anti_join functions to split review text into individual words and remove stop words (e.g., "the", "or", "and") |
| Ramen Reviews | 31:05 | Catching a mistake in the custom function causing it to read the same URL every time |
| Ramen Reviews | 31:55 | Using str_detect function to filter out review paragraphs without a keyword in it |
| Ramen Reviews | 32:40 | Using str_remove function and regex to get rid of string that follows a specific pattern |
| Ramen Reviews | 34:10 | Explanation of possibly and safely functions in purrr package |
| Ramen Reviews | 37:45 | Reviewing output of the URL that failed to scrape, including using character(0) as a default null value |
| Ramen Reviews | 48:00 | Using pairwise_cor function from widyr package to see which words tend to appear in reviews together |
| Ramen Reviews | 51:05 | Using igraph and ggraph packages to make network plot of word correlations |
| Ramen Reviews | 51:55 | Using geom_node_text function to add labels to network plot |
| Ramen Reviews | 52:35 | Including all words (not just those connected to others) as vertices in the network plot |
| Ramen Reviews | 54:40 | Tweaking and refining network plot aesthetics (vertex size and colour) |
| Ramen Reviews | 56:00 | Weird hack for getting a dark outline on hard-to-see vertex points |
| Ramen Reviews | 59:15 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Media Franchise Revenue | 9:15 | Explaining use of semi_join function to aggregate and filter groups |
| Media Franchise Revenue | 11:00 | Putting the largest categories on the bottom of a stacked bar chart |
| Media Franchise Revenue | 14:30 | Using glue function as alternative to paste for combining text, plus good explanation of it |
| Media Franchise Revenue | 19:30 | Multiple re-ordering using fct_reorder function of facetted graph (he works through several obstacles) |
| Media Franchise Revenue | 20:40 | Re-ordering the position of facetted graphs so that highest total revenue is at top left |
| Media Franchise Revenue | 26:00 | Investigating relationship between year created and revenue |
| Media Franchise Revenue | 26:40 | Creating scatter plot with points scaled by size and labelled points (geom_text function) |
| Media Franchise Revenue | 29:30 | Summary of screencast up to this point |
| Media Franchise Revenue | 29:50 | Starting analysis original media of franchise (e.g., novel, video game, animated film) and revenue type (e.g., box office, merchandise) |
| Media Franchise Revenue | 33:35 | Graphing original media and revenue category as facetted bar plot with lots of reordering (ends at around 38:40) |
| Media Franchise Revenue | 40:30 | Alternative visualization of original media/revenue category using heat map |
| Media Franchise Revenue | 41:20 | Using scale_fill_gradient2 function to specify custom colour scale |
| Media Franchise Revenue | 42:05 | Getting rid of gridlines in graph using theme function's panel.grid argument |
| Media Franchise Revenue | 44:05 | Using fct_rev function to reverse levels of factors |
| Media Franchise Revenue | 44:35 | Fixing overlapping axis text with tweaks to theme function's axis.text argument |
| Media Franchise Revenue | 46:05 | Reviewing visualization that inspired this dataset |
| Media Franchise Revenue | 47:25 | Adding text of total revenue to the end of each bar in a previous graph |
| Media Franchise Revenue | 50:20 | Using paste0 function at add a "B" (for "billions") to the end of text labels on graph |
| Media Franchise Revenue | 51:35 | Using expand_limits functions to give more space for text labels not to get cut off |
| Media Franchise Revenue | 53:45 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Women's World Cup | 2:15 | Adding country names using countrycode package |
| Women's World Cup | 3:45 | Web scraping country codes from Wikipedia |
| Women's World Cup | 6:00 | Combining tables that are separate lists into one dataframe |
| Women's World Cup | 14:00 | Using rev function (reverse) to turn multiple rows of soccer match scores into one row (base team and opposing team) |
| Women's World Cup | 26:30 | Applying a geom_smooth linear model line to a scatter plot, then facetting it |
| Women's World Cup | 28:30 | Adding a line with a slope of 1 (x = y) using geom_abline |
| Women's World Cup | 40:00 | Pulling out elements of a list that is embedded in a dataframe |
| Women's World Cup | 1:09:45 | Using glue function to add context to facet titles |
| Screencast | Time | Description |
|---|---|---|
| Bob Ross Paintings | 1:40 | Using clean_names function in janitor package to get field names to snake_case |
| Bob Ross Paintings | 1:50 | Using gather function (now 'pivot_longer') to get wide elements into tall (tidy) format |
| Bob Ross Paintings | 2:35 | Cleaning text (str_to_title, str_replace) to get into nicer-to-read format |
| Bob Ross Paintings | 3:30 | Using str_remove_all function to trim trimming quotation marks and backslashes |
| Bob Ross Paintings | 4:40 | Using extract function to extract the season number and episode number from episode field; uses regex capturing groups |
| Bob Ross Paintings | 14:00 | Using add_count function's name argument to specify field's name |
| Bob Ross Paintings | 15:35 | Getting into whether the elements of Ross's paintings changed over time (e.g., are mountains more/less common over time?) |
| Bob Ross Paintings | 20:00 | Quick point: could have used logistic regression to see change over time of elements |
| Bob Ross Paintings | 21:10 | Asking, "What elements tends to appear together?" prompting clustering analysis |
| Bob Ross Paintings | 22:15 | Using pairwise_cor to see which elements tend to appear together |
| Bob Ross Paintings | 22:50 | Discussion of a blind spot of pairwise correlation (high or perfect correlation on elements that only appear once or twice) |
| Bob Ross Paintings | 28:05 | Asking, "What are clusters of elements that belong together?" |
| Bob Ross Paintings | 28:30 | Creating network plot using ggraph and igraph packages |
| Bob Ross Paintings | 30:15 | Reviewing network plot for interesting clusters (e.g., beach cluster, mountain cluster, structure cluster) |
| Bob Ross Paintings | 31:55 | Explanation of Principal Component Analysis (PCA) |
| Bob Ross Paintings | 34:35 | Start of actual PCA coding |
| Bob Ross Paintings | 34:50 | Using acast function to create matrix of painting titles x painting elements (initially wrong, corrected at 36:30) |
| Bob Ross Paintings | 36:55 | Centering the matrix data using t function (transpose of matrix), colSums function, and colMeans functions |
| Bob Ross Paintings | 38:15 | Using svd function to performn singular value decomposition, then tidying with broom package |
| Bob Ross Paintings | 39:55 | Exploring one principal component to get a better feel for what PCA is doing |
| Bob Ross Paintings | 43:20 | Using reorder_within function to re-order factors within a grouping |
| Bob Ross Paintings | 48:00 | Exploring different matrix names in PCA (u, v, d) |
| Bob Ross Paintings | 56:50 | Looking at top 6 principal components of painting elements |
| Bob Ross Paintings | 57:45 | Showing percentage of variation that each principal component is responsible for |
| Screencast | Time | Description |
|---|---|---|
| Simpsons Guest Stars | 4:15 | Using str_detect function to find guests that played themselves |
| Simpsons Guest Stars | 7:55 | Using separate_rows function and regex to get delimited values onto different rows (e.g., "Edna Krabappel; Ms. Melon" gets split into two rows) |
| Simpsons Guest Stars | 9:55 | Using parse_number function to convert a numeric variable coded as character to a proper numeric variable |
| Simpsons Guest Stars | 14:45 | Downloading and importing supplementary dataset of dialogue |
| Simpsons Guest Stars | 16:10 | Using semi_join function to filter dataframe based on values that appear in another dataframe |
| Simpsons Guest Stars | 18:05 | Using anti_join function to check which values in a dataframe do not appear in another dataframe |
| Simpsons Guest Stars | 20:50 | Using ifelse function to recode a single value with another (i.e., "Edna Krapabbel" becomes "Edna Krabappel-Flanders") |
| Simpsons Guest Stars | 26:20 | Explaining the goal of all the data cleaning steps |
| Simpsons Guest Stars | 31:25 | Using sample function to get an example line for each character |
| Simpsons Guest Stars | 33:20 | Setting geom_histogram function's binwidth and center arguments to get specific bin sizes |
| Simpsons Guest Stars | 37:25 | Using unnest_tokens and anti_join functions from tidytext package to split dialogue into individual words and remove stop words (e.g., "the", "or", "and") |
| Simpsons Guest Stars | 38:55 | Using bind_tf_idf function from tidytext package to get the TF-IDF (term frequency-inverse document frequency) of individual words |
| Simpsons Guest Stars | 42:50 | Using top_n function to get the top 1 TF-IDF value for each role |
| Simpsons Guest Stars | 44:05 | Using paste0 function to combine two character variables (e.g., "Groundskeeper Willie" and "ach" (separate variables) become "Groundskeeper Willie: ach") |
| Simpsons Guest Stars | 48:10 | Explanation of what TF-IDF (text frequency-inverse document frequency) tells us and how it is a "catchphrase detector" |
| Simpsons Guest Stars | 56:40 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Pizza Ratings | 4:45 | Transforming time into something more readable (from time value of seconds since Unix epoch 1970-01-01), then converting it into a date |
| Pizza Ratings | 9:05 | Formatting x-axis text so that it is rotated and readable, then re-ordering using fct_relevel function so that it is in its proper ordinal order |
| Pizza Ratings | 11:00 | Converting string answers to integer counterparts to get an overall numeric value for how good each place is |
| Pizza Ratings | 12:30 | Commentary on speed of mutate calculation within or without a group (non-grouped is slightly faster) |
| Pizza Ratings | 15:30 | Re-ordering groups by total votes using fct_reorder function, while still maintaining the groups themselves |
| Pizza Ratings | 19:15 | Using glue package to combine place name and total respondents |
| Pizza Ratings | 20:30 | Using statistical test to give confidence intervals on average score |
| Pizza Ratings | 22:15 | Actually using the t.test function with toy example |
| Pizza Ratings | 23:15 | Using weighted linear model instead (which doesn't end up working) |
| Pizza Ratings | 26:00 | Using custom function with rep function to get vector of repeated scores (sneaky way of weighting) so that we can perform a proper t-test |
| Pizza Ratings | 27:30 | Summarizing t.test function into a list (alternative to nesting) |
| Pizza Ratings | 31:20 | Adding error bars using geom_errorbarh to make a TIE fighter plot that shows confidence intervals |
| Pizza Ratings | 36:30 | Bringing in additional data from Barstool ratings (to supplement survey of Open R meetup NY) |
| Pizza Ratings | 39:45 | Getting survey data to the place level so that we can add an additional dataset |
| Pizza Ratings | 41:15 | Checking for duplicates in the joined data |
| Pizza Ratings | 42:15 | Calling off the planned analysis due to low sample sizes (too much noise, not enough overlap between datasets) |
| Pizza Ratings | 45:15 | Looking at Barstool data on its own |
| Pizza Ratings | 55:15 | Renaming all variables with a certain string pattern in them |
| Pizza Ratings | 58:00 | Comparing Dave's reviews with all other critics |
| Pizza Ratings | 59:15 | Adding geom_abline showing x = y as comparison for geom_smooth linear model line |
| Pizza Ratings | 1:02:30 | Changing the location of the aes function to change what the legend icons look like for size aesthetic |
| Screencast | Time | Description |
|---|---|---|
| Car Fuel Efficiency | 3:20 | Using select, sort, and colnames functions to sort variables in alphabetical order |
| Car Fuel Efficiency | 10:00 | Adding geom_abline for y = x to a scatter plot for comparison |
| Car Fuel Efficiency | 18:00 | Visualising using geom_boxplot for mpg by vehicle class (size of car) |
| Car Fuel Efficiency | 24:45 | Start of explanation of prediction goals |
| Car Fuel Efficiency | 27:00 | Creating train and test sets, along with trick using sample_frac function to randomly re-arrange all rows in a dataset |
| Car Fuel Efficiency | 28:35 | First step of developing linear model: visually adding geom_smooth |
| Car Fuel Efficiency | 30:00 | Using augment function to add extra variables from model to original dataset (fitted values and residuals, especially) |
| Car Fuel Efficiency | 30:45 | Creating residuals plot and explaining what you want and don't want to see |
| Car Fuel Efficiency | 31:50 | Explanation of splines |
| Car Fuel Efficiency | 33:30 | Visualising effect of regressing using natural splines |
| Car Fuel Efficiency | 35:10 | Creating a tibble to test different degrees of freedom (1:10) for natural splines |
| Car Fuel Efficiency | 36:30 | Using unnest function to get tidy versions of different models |
| Car Fuel Efficiency | 37:55 | Visualising fitted values of all 6 different models at the same time |
| Car Fuel Efficiency | 42:10 | Investigating whether the model got "better" as we added degrees of freedom to the natural splines, using the glance function |
| Car Fuel Efficiency | 47:45 | Using ANOVA to perform a statistical test on whether natural splines as a group explain variation in MPG |
| Car Fuel Efficiency | 48:30 | Exploring colinearity of dependant variables (displacement and cylinders) |
| Car Fuel Efficiency | 55:10 | Binning years into every two years using floor function |
| Car Fuel Efficiency | 56:40 | Using summarise_at function to do quick averaging of multiple variables |
| Screencast | Time | Description |
|---|---|---|
| Horror Movies | 4:15 | Extracting digits (release year) from character string using regex, along with good explanation of extract function |
| Horror Movies | 8:00 | Quick check on why parse_number is unable to parse some values -- is it because they are NA or some other reason? |
| Horror Movies | 9:45 | Visually investigating correlation between budget and rating |
| Horror Movies | 11:50 | Investigating correlation between MPAA rating (PG-13, R, etc.) and rating using boxplots |
| Horror Movies | 12:50 | Using pull function to quickly check levels of a factor |
| Horror Movies | 13:30 | Using ANOVA to check difference of variation within groups (MPAA rating) than between groups |
| Horror Movies | 15:40 | Separating genre using separate_rows function (instead of str_split and unnest) |
| Horror Movies | 18:00 | Removing boilerplate "Directed by..." and "With..." part of plot variable and isolating plot, first using regex, then by using separate function with periods as separator |
| Horror Movies | 20:40 | Unnesting word tokens, removing stop words, and counting appearances |
| Horror Movies | 21:20 | Aggregating by word to find words that appear in high- or low-rated movies |
| Horror Movies | 23:00 | Discussing potential confounding factors for ratings associated with specific words |
| Horror Movies | 24:50 | Searching for duplicated movie titles |
| Horror Movies | 25:50 | De-duping using distinct function |
| Horror Movies | 26:55 | Loading in and explaining glmnet package |
| Horror Movies | 28:00 | Using movie titles to pull out ratings using rownmaes and match functions to create an index of which rating to pull out of the original dataset |
| Horror Movies | 29:10 | Actually using glmnet function to create lasso model |
| Horror Movies | 34:05 | Showing built-in plot of lasso lambda against mean-squared error |
| Horror Movies | 37:05 | Explaining when certain terms appeared in the lasso model as the lambda value dropped |
| Horror Movies | 41:10 | Gathering all variables except for title, so that the dataset is very tall |
| Horror Movies | 42:35 | Using unite function to combine two variables (better alternative to paste) |
| Horror Movies | 45:45 | Creating a new lasso with tons of new variables other than plot words |
| Screencast | Time | Description |
|---|---|---|
| NYC Squirrel Census | 5:45 | Starter EDA of latitude and longitude using geom_point |
| NYC Squirrel Census | 6:45 | Aggregating squirrel counts by hectare to get a "binned" map |
| NYC Squirrel Census | 9:00 | Investigating colour notes |
| NYC Squirrel Census | 10:30 | Asking question, "Are there areas of the parks where we see certain-coloured squirrels |
| NYC Squirrel Census | 12:45 | Plotting latitude and percentage of gray squirrels to answer, "Do we get a lower proportion of gray squirrels as we go farther north?" |
| NYC Squirrel Census | 13:30 | Using logistic regression to test gray squirrel (proportion as we go farther north) |
| NYC Squirrel Census | 16:30 | Noting that he could have used original data sets as input for logistic regression function |
| NYC Squirrel Census | 19:30 | "Does a squirrel run away?" based on location in the park (latitude), using logistic regression |
| NYC Squirrel Census | 20:45 | Using summarise_at function to apply same function to multiple variables |
| NYC Squirrel Census | 25:25 | Loading ggmap package |
| NYC Squirrel Census | 27:00 | Start using ggmap, with the get_map function |
| NYC Squirrel Census | 28:20 | Decision to not set up Google API key to use ggmap properly |
| NYC Squirrel Census | 30:15 | Using the sf package to read in a shapefile of Central Park |
| NYC Squirrel Census | 30:40 | Using read_sf function from sf package to import a shapefile into R |
| NYC Squirrel Census | 31:30 | Using geom_sf function from sf package to visualise the imported shapefile |
| NYC Squirrel Census | 32:45 | Combining shapefile "background" with relevant squirrel data in one plot |
| NYC Squirrel Census | 34:40 | Visualising pathways (footpaths, bicycle paths) in the shapefile |
| NYC Squirrel Census | 37:55 | Finishing visualisation and moving on to analysing activity types |
| NYC Squirrel Census | 38:45 | Selecting fields based on whether they end with "ing", then gathering those fields into tidy format |
| NYC Squirrel Census | 39:50 | Decision to create a shiny visualisation |
| NYC Squirrel Census | 41:30 | Setting shiny app settings (e.g., slider for minimum number of squirrels) |
| NYC Squirrel Census | 42:15 | Setting up shiny app options / variables |
| NYC Squirrel Census | 43:50 | Explanation of why setting up options in shiny app the way he did |
| NYC Squirrel Census | 46:00 | Solving error "Discrete value supplied to continuous scale" |
| NYC Squirrel Census | 46:50 | First draft of shiny app |
| NYC Squirrel Census | 48:35 | Creating a dynamic midpoint for the two-gradient scale in the shiny app |
| NYC Squirrel Census | 51:30 | Adding additional variables of more behaviours to shiny app (kuks, moans, runs from, etc.) |
| NYC Squirrel Census | 53:10 | "What are the distributions of some of these behaviours?" |
| NYC Squirrel Census | 56:50 | Adding ground location (above ground, ground plane) to shiny app |
| NYC Squirrel Census | 58:20 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| CRAN Package Code | 4:30 | Summarizing many things by language (e.g., lines of code, comment/code ratio) |
| CRAN Package Code | 9:35 | Using gather function (now pivot_longer) to consolidate multiple metrics into one dimension, then visualizing by facetting by metric |
| CRAN Package Code | 11:20 | Setting ncol = 1 within facet_wrap function to get facetted graphs to stack vertically |
| CRAN Package Code | 11:30 | Using reorder_within function from tidytext package to properly reorder factors within each facet |
| CRAN Package Code | 16:00 | Using geom_text label to add language name as label to scatter points |
| CRAN Package Code | 20:00 | Completing preliminary overview and looking at distribution of R code in packages |
| CRAN Package Code | 26:15 | Using str_extract to extract only letters and names from character vector (using regex) |
| CRAN Package Code | 34:00 | Re-ordering the order of categorical variables in the legend using guides function |
| CRAN Package Code | 36:00 | Investigating comment/code ratio |
| CRAN Package Code | 43:05 | Importing additional package data (looking around for a bit, then starting to actually import ~46:00) |
| CRAN Package Code | 54:40 | Importing even more additional data (available packages) |
| CRAN Package Code | 57:50 | Using separate_rows function to separate delimited values |
| CRAN Package Code | 58:45 | Using extract function and regex to pull out specific types of characters from a string |
| CRAN Package Code | 1:05:35 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Spelling Bee Honeycomb | 2:00 | Using read_lines function to import a plain text file (.txt) |
| Riddler: Spelling Bee Honeycomb | 2:35 | Using str_detect function to filter out words that do not contain the letter "g" |
| Riddler: Spelling Bee Honeycomb | 3:25 | Using str_split function to get a list of a word's individual letters |
| Riddler: Spelling Bee Honeycomb | 3:55 | Using setdiff function to find words with invalid letters (letters that are not in the puzzle honeycomb) -- also needs map function (at 4:35) |
| Riddler: Spelling Bee Honeycomb | 10:45 | Changing existing code to make a function that will calculate scores for letter combinations |
| Riddler: Spelling Bee Honeycomb | 14:10 | Noticing the rule about bonus points for pangrams and using n_distinct function to determine if a word gets those points |
| Riddler: Spelling Bee Honeycomb | 17:25 | Using map function to eliminate duplicate letters from each word's list of component letters |
| Riddler: Spelling Bee Honeycomb | 25:55 | Using acast function from reshape2 package to create a matrix of words by letters |
| Riddler: Spelling Bee Honeycomb | 27:50 | Using the words/letters matrix to find valid words for a given letter combination |
| Riddler: Spelling Bee Honeycomb | 29:55 | Using the matrix multiplication operator %*% to find the number of "forbidden" letters for each word |
| Riddler: Spelling Bee Honeycomb | 42:05 | Using microbenchmark function from microbenchmark package to test how long it takes to run a function |
| Riddler: Spelling Bee Honeycomb | 43:35 | Using combn function to get the actual combinations of 6 letters (not just the count) |
| Riddler: Spelling Bee Honeycomb | 45:15 | Using map function to get scores for different combinations of letters created above |
| Riddler: Spelling Bee Honeycomb | 47:30 | Using which.max function to find the position of the max value in a vector |
| Riddler: Spelling Bee Honeycomb | 1:05:10 | Using t function to transpose a matrix |
| Riddler: Spelling Bee Honeycomb | 1:19:15 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| The Office | 1:45 | Overview of transcripts data |
| The Office | 2:25 | Overview of ratintgs data |
| The Office | 4:10 | Using fct_inorder function to create a factor with levels based on when they appear in the dataframe |
| The Office | 4:50 | Using theme and element_text functions to turn axis labels 90 degrees |
| The Office | 5:55 | Creating a line graph with points at each observation (using geom_line and geom_point) |
| The Office | 7:10 | Adding text labels to very high and very low-rated episodes |
| The Office | 8:50 | Using theme function's panel.grid.major argument to get rid of some extraneous gridlines, using element_blank function |
| The Office | 10:15 | Using geom_text_repel from ggrepel package to experiment with different labelling (before abandoning this approach) |
| The Office | 12:45 | Using row_number function to add episode_number field to make graphing easier |
| The Office | 14:05 | Explanation of why number of ratings (votes) is relevant to interpreting the graph |
| The Office | 19:10 | Using unnest_tokens function from tidytext package to split full-sentence text field to individual words |
| The Office | 20:10 | Using anti_join function to filter out stop words (e.g., and, or, the) |
| The Office | 22:25 | Using str_remove_all function to get rid of quotation marks from character names (quirks that might pop up when parsing) |
| The Office | 25:40 | Asking, "Are there words that are specific to certain characters?" (using bind_tf_idf function) |
| The Office | 32:25 | Using reorder_within function to re-order factors within a grouping (when a term appears in multiple groups) and scale_x_reordered function to graph |
| The Office | 37:05 | Asking, "What effects the popularity of an episode?" |
| The Office | 37:55 | Dealing with inconsistent episode names between datasets |
| The Office | 41:25 | Using str_remove function and some regex to remove "(Parts 1&2)" from some episode names |
| The Office | 42:45 | Using str_to_lower function to further align episode names (addresses inconsistent capitalization) |
| The Office | 52:20 | Setting up dataframe of features for a LASSO regression, with director and writer each being a feature with its own line |
| The Office | 52:55 | Using separate_rows function to separate episodes with multiple writers so that each has their own row |
| The Office | 58:25 | Using log2 function to transform number of lines fields to something more useable (since it is log-normally distributed) |
| The Office | 1:00:20 | Using cast_sparse function from tidytext package to create a sparse matrix of features by episode |
| The Office | 1:01:55 | Using semi_join function as a "filtering join" |
| The Office | 1:02:30 | Setting up dataframes (after we have our features) to run LASSO regression |
| The Office | 1:03:50 | Using cv.glmnet function from glmnet package to run a cross-validated LASSO regression |
| The Office | 1:05:35 | Explanation of how to pick a lambda penalty parameter |
| The Office | 1:05:55 | Explanation of output of LASSO model |
| The Office | 1:09:25 | Outline of why David likes regularized linear models (which is what LASSO is) |
| The Office | 1:10:55 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| COVID-19 Open Research Dataset (CORD-19) | 0:55 | Disclaimer that David's not an epidemiologist |
| COVID-19 Open Research Dataset (CORD-19) | 2:55 | Overview of dataset |
| COVID-19 Open Research Dataset (CORD-19) | 7:50 | Using dir function with its full.names argument to get file paths for all files in a folder |
| COVID-19 Open Research Dataset (CORD-19) | 9:45 | Inspecting JSON-formatted data |
| COVID-19 Open Research Dataset (CORD-19) | 10:40 | Introducing hoist function as a way to deal with nested lists (typical for JSON data) |
| COVID-19 Open Research Dataset (CORD-19) | 11:40 | Continuing to use the hoist function |
| COVID-19 Open Research Dataset (CORD-19) | 13:10 | Brief explanation of pluck specification |
| COVID-19 Open Research Dataset (CORD-19) | 16:35 | Using object.size function to check size of JSON data |
| COVID-19 Open Research Dataset (CORD-19) | 17:40 | Using map_chr and str_c functions together to combine paragraphs of text in a list into a single character string |
| COVID-19 Open Research Dataset (CORD-19) | 20:00 | Using unnest_tokens function from tidytext package to split full paragraphs into individual words |
| COVID-19 Open Research Dataset (CORD-19) | 22:50 | Overview of scispaCy package for Python, which has named entity recognition features |
| COVID-19 Open Research Dataset (CORD-19) | 24:40 | Introducting spacyr package, which is a R wrapper around the Python scispaCy package |
| COVID-19 Open Research Dataset (CORD-19) | 28:50 | Showing how tidytext can use a custom tokenization function (David uses spacyr package's named entity recognition) |
| COVID-19 Open Research Dataset (CORD-19) | 32:20 | Demonstrating the tokenize_words function from the tokenizers package |
| COVID-19 Open Research Dataset (CORD-19) | 37:00 | Actually using a custom tokenizer in unnest_tokens function |
| COVID-19 Open Research Dataset (CORD-19) | 39:45 | Using sample_n function to get a random sample of n rows |
| COVID-19 Open Research Dataset (CORD-19) | 43:25 | Asking, "What are groups of words that tend to occur together?" |
| COVID-19 Open Research Dataset (CORD-19) | 44:30 | Using pairwise_cor from widyr package to find correlation between named entities |
| COVID-19 Open Research Dataset (CORD-19) | 45:40 | Using ggraph and igraph packages to create a network plot |
| COVID-19 Open Research Dataset (CORD-19) | 52:05 | Starting to look at papers' references |
| COVID-19 Open Research Dataset (CORD-19) | 53:30 | Using unnest_longer then unnest_wider function to convert lists into a tibble |
| COVID-19 Open Research Dataset (CORD-19) | 59:30 | Using str_trunc function to truncate long character strings to a certain number of characters |
| COVID-19 Open Research Dataset (CORD-19) | 1:06:25 | Using glue function for easy combination of strings and R code |
| COVID-19 Open Research Dataset (CORD-19) | 1:19:15 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| CORD-19 Data Package | 1:10 | Overview of JSON files with the data David will make a package of |
| CORD-19 Data Package | 3:05 | Starting to create a new package with "New Project" in RStudio |
| CORD-19 Data Package | 5:40 | Creating a file to reference the license for the dataset |
| CORD-19 Data Package | 7:25 | Using use_data_raw function from usethis package to set up a folder structure and preliminary function for raw data |
| CORD-19 Data Package | 8:30 | Explanation that we want to limit the number of packages we load when building a package (e.g., no library(tidyverse) ) |
| CORD-19 Data Package | 9:00 | Using use_package function from usethis package to add "Suggested packages" |
| CORD-19 Data Package | 10:15 | Reviewing import and cleaning code already completed |
| CORD-19 Data Package | 14:55 | Using roxygen2 package to write documentation |
| CORD-19 Data Package | 19:35 | More documentation writing |
| CORD-19 Data Package | 24:50 | Using use_data function from usethis package to create a folder structure and datafile for (finished/cleaned) data |
| CORD-19 Data Package | 26:10 | Making a mistake clicking "Install and Restart" button on the "Build" tab (because of huge objects in the environment) (see 26:50 for alternative) |
| CORD-19 Data Package | 26:50 | Using load_all function from devtrools package as an alternative to "Install and Restart" from above step |
| CORD-19 Data Package | 27:35 | Using document function from devtools package to process written documentation |
| CORD-19 Data Package | 32:20 | De-duplicating paper data in a way the keeps records that have fewer missing values than other records for the same paper |
| CORD-19 Data Package | 39:50 | Using use_data function with its overwrite argument to overwrite existing data |
| CORD-19 Data Package | 47:30 | Writing documentation for paragraphs data |
| CORD-19 Data Package | 57:55 | Testing an install of the package |
| CORD-19 Data Package | 59:30 | Adding link to code in documentation |
| CORD-19 Data Package | 1:03:00 | Writing examples of how to use the package (in documentation) |
| CORD-19 Data Package | 1:08:45 | Discussion of outstanding items that David hasn't done yet (e.g., readme, vignettes, tests) |
| CORD-19 Data Package | 1:09:20 | Creating a simple readme, including examples, with use_readme_rmd function from usethis package |
| CORD-19 Data Package | 1:16:10 | Using knit function from the knitr package to knit the readme into a markdown file |
| CORD-19 Data Package | 1:17:10 | Creating a GitHub repository to host the package (includes how to commit to a GitHub repo using RStudio's GUI) |
| CORD-19 Data Package | 1:18:15 | Explanation that version 0.0.0.9000 means that the package is in early development |
| CORD-19 Data Package | 1:20:30 | Actually creating the GitHub repository |
| CORD-19 Data Package | 1:22:25 | Overview of remaining tasks |
| Screencast | Time | Description |
|---|---|---|
| R trick: Creating Pascal's Triangle with accumulate() | 1:10 | Simple explanation of accumulate function |
| R trick: Creating Pascal's Triangle with accumulate() | 1:30 | Example using letters |
| R trick: Creating Pascal's Triangle with accumulate() | 2:55 | Using tilde ~ to create an anonymous function |
| R trick: Creating Pascal's Triangle with accumulate() | 4:35 | Introducing Pascal's Triangle |
| R trick: Creating Pascal's Triangle with accumulate() | 6:25 | Starting to create Pascal's triangle in R |
| R trick: Creating Pascal's Triangle with accumulate() | 8:05 | Concerting the conceptual solution into an accumulate function |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Simulating Replacing Die Sides | 0:45 | Explaining why the recursive nature of this problem is well-suited to simulation |
| Riddler: Simulating Replacing Die Sides | 2:05 | Introducing the accumulate function as a tool for simulation |
| Riddler: Simulating Replacing Die Sides | 3:50 | Creating a condition to call the done function |
| Riddler: Simulating Replacing Die Sides | 7:00 | After creating a function to simulate one round of the problem, using replicate function to run simulation many times |
| Riddler: Simulating Replacing Die Sides | 7:15 | Using qplot function to quickly create a histogram of simulations |
| Riddler: Simulating Replacing Die Sides | 7:40 | Making observations on the distribution of simulations (looks kind of like a gamma distribution) |
| Riddler: Simulating Replacing Die Sides | 10:05 | Observing that the distribution is kind of log-normal (but that doesn't really apply because we're using integers) |
| Riddler: Simulating Replacing Die Sides | 10:35 | Using table and sort functions to find the most common number of rolls |
| Riddler: Simulating Replacing Die Sides | 11:20 | Starting the Extra Credit portion of the problem (N-sided die) |
| Riddler: Simulating Replacing Die Sides | 11:40 | Using the crossing function to set up a tibble to run simulations |
| Riddler: Simulating Replacing Die Sides | 12:35 | Using map_dbl function to apply a set of simulations to each possibility of N sides |
| Riddler: Simulating Replacing Die Sides | 13:30 | Spotting an error in the formula for simulating one round (6-sided die was hard-coded) |
| Riddler: Simulating Replacing Die Sides | 16:40 | Using simple linear regression with the lm function to find the relationship between number of sides and average number of rolls |
| Riddler: Simulating Replacing Die Sides | 17:20 | Reviewing distributions for different N-sided dice |
| Riddler: Simulating Replacing Die Sides | 18:00 | Calculating variance, standard deviation, and coefficient of variation to get hints on the distribution (and ruling out Poisson) |
| Screencast | Time | Description |
|---|---|---|
| Beer Production | 4:25 | Asking, "What ingredients are used in beer?" |
| Beer Production | 4:40 | Using filter and max functions to look at the most recent period of time |
| Beer Production | 7:25 | Using paste and ymd functions (ymd is from lubridate package) to convert year-month field into an date-formatted field |
| Beer Production | 9:20 | Spotting potential missing or mis-parsed data |
| Beer Production | 13:50 | Introducing the tidymetrics framework |
| Beer Production | 14:45 | Using install_github function to install tidymetrics from GitHub |
| Beer Production | 15:25 | Using cross_by_dimensions function from tidymetrics package to get aggregations at different levels of multiple dimensions |
| Beer Production | 18:10 | Using cross_by_periods function from tidymetrics package to also get aggregations for different intervals (e.g, month, quarter, year) |
| Beer Production | 22:00 | Using use_metrics_scaffold function from tidymetrics package to create framework for documenting dimensions in RMarkdown YAML header |
| Beer Production | 24:00 | Using create_metrics function from tidymetrics package to save data as a tibble with useful metadata (good for visualizing interactively) |
| Beer Production | 25:15 | Using preview_metric function from shinymetrics package (still under development as of 2020-04-24) to demonstrate shinymetrics |
| Beer Production | 27:35 | Succesfuly getting shinymetrics to work |
| Beer Production | 28:25 | Explanation of the shinymetrics bug David ran into |
| Beer Production | 34:10 | Changing order of ordinal variable (e.g., "1,000 to 10,000" and "10,000 to 20,000") using the parse_number, fct_lump, and coalesce functions |
| Beer Production | 41:25 | Asking, "Where is beer produced?" |
| Beer Production | 46:45 | Looking up sf package documentation to refresh memory on how to draw state borders for a map |
| Beer Production | 48:55 | Using match function and state.abb vector (state abbreviations) from sf package to perform a lookup of state names |
| Beer Production | 51:05 | Using geom_sf function (and working through some hiccoughs) to create a choropleth map |
| Beer Production | 52:30 | Using theme_map function from ggthemes package to get more appropriate styling for maps |
| Beer Production | 55:40 | Experimenting with how to get the legend to display in the bottom right corner |
| Beer Production | 58:25 | Starting to build an animation of consumption patterns over time using gganimate package |
| Beer Production | 1:03:40 | Getting the year being animated to show up in the title of a gganimate map |
| Beer Production | 1:05:40 | Summary of screencast |
| Beer Production | 1:06:50 | Spotting a mistake in a group_by call causing the percentages not to add up properly |
| Beer Production | 1:09:10 | Brief extra overview of tidymetrics code |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Simulating a Non-increasing Sequence | 2:20 | Introducing accumulate functon as a possible solution (but not used here) |
| Riddler: Simulating a Non-increasing Sequence | 3:20 | Using sample function to simulate 1000 rolls of a 10-sided die |
| Riddler: Simulating a Non-increasing Sequence | 3:40 | Explanation of dividing sample rolls into streaks (instead of using logic similar to a while loop) |
| Riddler: Simulating a Non-increasing Sequence | 4:55 | Using cumsum function to separate 1000 rolls into individual sequences (which end when a 0 is rolled) |
| Riddler: Simulating a Non-increasing Sequence | 5:50 | Using lag function to "shift" sequence numbering down by one row |
| Riddler: Simulating a Non-increasing Sequence | 7:35 | Using cummax and lag functions to check whether a roll is less than the highest value rolled previously in the sequence |
| Riddler: Simulating a Non-increasing Sequence | 9:30 | Fixing previous step with cummin function (instead of cummax) and dropping the lag function |
| Riddler: Simulating a Non-increasing Sequence | 13:05 | Finished simulation code and starting to calculate scores |
| Riddler: Simulating a Non-increasing Sequence | 13:10 | Using -row_number function (note the minus sign!) to calculate decimal position of number in the score |
| Riddler: Simulating a Non-increasing Sequence | 15:30 | Investigating the distribution of scores |
| Riddler: Simulating a Non-increasing Sequence | 16:25 | Using seq function in the breaks argument of scale_x_continuous to set custom, evenly-spaced axis ticks and labels |
| Screencast | Time | Description |
|---|---|---|
| Tour de France | 3:55 | Getting an overview of the data |
| Tour de France | 8:55 | Aggregating data into decades using the truncated division operator %/% |
| Tour de France | 21:50 | Noting that death data is right-censored (i.e., some winners are still alive) |
| Tour de France | 24:05 | Using transmute function, which combines functionality of mutate (to create new variables) and select (to choose variables to keep) |
| Tour de France | 25:30 | Using survfit function from survival package to conduct survival analysis |
| Tour de France | 27:30 | Using glance function from broom package to get a one-row model summary of the survival model |
| Tour de France | 31:00 | Using extract function to pull out a string matching a regular expression from a variable (stage number in this case) |
| Tour de France | 34:30 | Theorizing that there is a parsing issue with the original data's time field |
| Tour de France | 41:15 | Using group_by function's built-in "peeling" feature, where a summarise call will "peel away" one group but left other groupings intact |
| Tour de France | 42:05 | Using rank function, then upgrading to percent_rank function to give percentile rankings (between 0 and 1) |
| Tour de France | 47:50 | Using geom_smooth function with method argument as "lm" to plot a linear regression |
| Tour de France | 48:10 | Using cut function to bin numbers (percentiles in this case) into categories |
| Tour de France | 50:25 | Reviewing boxplots exploring relationship between first-stage performance and overall Tour performance |
| Tour de France | 51:30 | Starting to create an animation using gganimate package |
| Tour de France | 56:00 | Actually writing the code to create the animation |
| Tour de France | 58:20 | Using reorder_within function from tidytext package to re-order factors that have the same name across multiple groups |
| Tour de France | 1:02:40 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Simulating a Branching Process | 0:35 | Explanation of a Poisson process |
| Riddler: Simulating a Branching Process | 2:40 | Asking "How long do you have to wait for X to happen?", which the Exponential distribution can answer |
| Riddler: Simulating a Branching Process | 4:20 | Using rexp function to generate numbers from the Exponential distribution |
| Riddler: Simulating a Branching Process | 5:25 | Using a vector of rates inside the rexp function (to explore consecutive waiting times) |
| Riddler: Simulating a Branching Process | 7:05 | Using cumsum function to calculate total waiting time until hitting a specific number in the Poisson process |
| Riddler: Simulating a Branching Process | 7:35 | Using which function to determine the first instance > 3 in a vector |
| Riddler: Simulating a Branching Process | 9:20 | Using replicate function to do a quick simulation of the function just written |
| Riddler: Simulating a Branching Process | 10:55 | Discussing methods of making the simulation function faster |
| Riddler: Simulating a Branching Process | 12:00 | Using crossing function to set up "tidy" simulation (gives you all possible combinations of values you provide it) |
| Riddler: Simulating a Branching Process | 13:15 | Noting how the consecutive waiting times seems to follow the Harmonic series |
| Riddler: Simulating a Branching Process | 17:10 | Noticing that we are missing trials with 0 comments and fixing |
| Riddler: Simulating a Branching Process | 20:25 | Using nls function (non-linear least squares) to test how well the data fits with an exponential curve |
| Riddler: Simulating a Branching Process | 23:05 | Visualizing fit between data and the exponential curve calculated with nls in previous step |
| Riddler: Simulating a Branching Process | 23:50 | Using augment function to added fitted values of the nls function |
| Riddler: Simulating a Branching Process | 26:00 | Exploring whether the data actually follows a Geometric distribution |
| Riddler: Simulating a Branching Process | 30:55 | Explanation of the Geometric distribution as it applies to this question |
| Riddler: Simulating a Branching Process | 34:05 | Generalizing the question to ask how long it takes to get to multiple comments (not just 3) |
| Riddler: Simulating a Branching Process | 38:45 | Explanation of why we subtract 1 when fitting an exponential curve |
| Riddler: Simulating a Branching Process | 46:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| GDPR Violations | 4:05 | Use the mdy function from the lubridate package to change the date variable from character class to date class. |
| GDPR Violations | 5:35 | Use the rename function from the dplyr package to rename variable in the dataset. |
| GDPR Violations | 6:15 | Use the fct_reorder function from the forcats package to sort the geom_col in descending order. |
| GDPR Violations | 6:30 | Use the fct_lump function from the forcats package within count to lump together country names except for the 6 most frequent. |
| GDPR Violations | 7:05 | Use the scale_x_continuous function from ggplot2 with the scales package to change the x-axis values to dollar format. |
| GDPR Violations | 8:15 | Use the month and floor_date function from the lubridate package to get the month component from the date variable to count the total fines per month. |
| GDPR Violations | 8:55 | Use the na_if function from the dplyr package to convert specific date value to NA. |
| GDPR Violations | 11:05 | Use the fct_reorder function from the forcats package to sort the stacked geom_col and legend labels in descending order. |
| GDPR Violations | 15:15 | Use the dollar function from the scales package to convert the price variable into dollar format. |
| GDPR Violations | 15:40 | Use the str_trunc to shorten the summary string values to 140 characters. |
| GDPR Violations | 17:35 | Use the separate_rows function from the tidyr package with a regular expression to separate the values in the article_violated variable with each matching group placed in its own row. |
| GDPR Violations | 19:30 | Use the extract function from the tidyr package with a regular expression to turn each matching group into a new column. |
| GDPR Violations | 27:30 | Use the geom_jitter function from the ggplot2 package to add points to the horizontal box plot. |
| GDPR Violations | 31:55 | Use the inner_join function from the dplyr package to join together article_titles and separated_articles tables. |
| GDPR Violations | 32:55 | Use the paste0 function from base R to concatenate article and article_title. |
| GDPR Violations | 38:48 | Use the str_detect function from the stringr package to detect the presence of a pattern in a string. |
| GDPR Violations | 40:25 | Use the group_by and summarize functions from the dplyr package to aggregate fines that were issued to the same country on the same day allowing for size to be used in geom_point plot. |
| GDPR Violations | 41:14 | Use the scale_size_continuous function from the ggplot2 package to remove the size legend. |
| GDPR Violations | 42:55 | Create an interactive dashboard using the shinymetrics and tidymetrics which is a tidy approach to business intelligence. |
| GDPR Violations | 47:25 | Use the cross_by_dimensions and cross_by_periods functions from the tidyr package which stacks an extra copy of the table for each dimension specified as an argument (country, article_title, type), replaces the value of the column with the word All and periods, and groups by all the columns. It acts as an extended group_by that allows complete summaries across each individual dimension and possible combinations. |
| Screencast | Time | Description |
|---|---|---|
| Broadway Musicals | 8:15 | Use the cross_by_periods function from the tidymetrics package to aggregate data over time (month, quarter, and year) then visualize with geom_line. |
| Broadway Musicals | 14:00 | Use the cross_by_periods function from the tidymetrics package with windows = c(28)) to create a 4-week rolling average across month, quarter, and year. |
| Broadway Musicals | 21:50 | Create and interactive dashboard using the shinymetrics and tidymetrics packages. |
| Broadway Musicals | 25:00 | Use the str_remove function from the stringr package to remove matched pattern in a string. |
| Broadway Musicals | 25:20 | Use the cross_by_dimensions function from the tidymetrics package which acts as an extended group_by that allows complete summaries across each individual dimension and possible combinations. |
| Broadway Musicals | 41:25 | Use the shinybones package to create an interactive dashboard to visualize all 3 metrics at the same time. |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Simulating and Optimizing Coin Flipping | 2:15 | Using crossing function to set up "tidy" simulation (gives you all possible combinations of values you provide it) |
| Riddler: Simulating and Optimizing Coin Flipping | 3:00 | Using rbinom function to simulate the number of prisoners who choose to flip, then using rbinom again to simulate number of tails |
| Riddler: Simulating and Optimizing Coin Flipping | 7:20 | Using dbinom function (probability mass function) to see probabilities of any given number of prisoners choosing to flip |
| Riddler: Simulating and Optimizing Coin Flipping | 10:15 | Using map_dbl function to iterate a function, making sure to return a dbl-class object |
| Riddler: Simulating and Optimizing Coin Flipping | 11:25 | Using seq_len(n) instead of 1:n to be slightly more efficient |
| Riddler: Simulating and Optimizing Coin Flipping | 12:20 | Using optimise function to conduct single-dimension optimisation (for analytical solution to this question) |
| Riddler: Simulating and Optimizing Coin Flipping | 14:15 | Using backticks (like this) for inline R functions in RMarkdown |
| Riddler: Simulating and Optimizing Coin Flipping | 15:15 | Starting the Extra Credit portion of the problem (N prisoners instead of 4) |
| Riddler: Simulating and Optimizing Coin Flipping | 16:30 | Using map2_dbl function to iterate a function that requires two inputs (and make sure it returns a dbl-class object) |
| Riddler: Simulating and Optimizing Coin Flipping | 20:05 | Reviewing visualisation of probabilties with a varying numbers of prisoners |
| Riddler: Simulating and Optimizing Coin Flipping | 21:30 | Tweaking graph to look nicer |
| Riddler: Simulating and Optimizing Coin Flipping | 22:00 | Get the exact optimal probability value for each number of prisoners |
| Riddler: Simulating and Optimizing Coin Flipping | 22:45 | Troubleshooting optimise function to work when iterated over different numbers of prisoners |
| Riddler: Simulating and Optimizing Coin Flipping | 23:45 | Using unnest_wider function to disaggregate a list, but put different elements on separate columns (not separate rows, which unnest does |
| Riddler: Simulating and Optimizing Coin Flipping | 25:30 | Explanation of what happens to probabilities as number of prisoners increases |
| Screencast | Time | Description |
|---|---|---|
| Animal Crossing | 5:05 | Starting text analysis of critic reviews of Animal Crossing |
| Animal Crossing | 7:50 | Using floor_date function from lubridate package to round dates down to nearest month (then week) |
| Animal Crossing | 9:00 | Using unnest_tokens function and anti_join functions from tidytext package to break reviews into individual words and remove stop words |
| Animal Crossing | 10:35 | Taking the average rating associated with individual words (simple approach to gauge sentiment) |
| Animal Crossing | 12:30 | Using geom_line and geom_point to graph ratings over time |
| Animal Crossing | 14:40 | Using mean function and logical statement to calculate percentages that meet a certain condition |
| Animal Crossing | 22:30 | Using geom_text to visualize what words are associated with positive/negative reviews |
| Animal Crossing | 27:00 | Disclaimer that this exploration is not text regression -- wine ratings screencast is a good resource for that |
| Animal Crossing | 28:30 | Starting to do topic modelling |
| Animal Crossing | 30:45 | Explanation of stm function from stm package |
| Animal Crossing | 34:30 | Explanation of stm function's output (topic modelling output) |
| Animal Crossing | 36:55 | Changing the number of topics from 4 to 6 |
| Animal Crossing | 37:40 | Explanation of how topic modelling works conceptually |
| Animal Crossing | 40:55 | Using tidy function from broom package to find which "documents" (reviews) were the "strongest" representation of each topic |
| Animal Crossing | 44:50 | Noting that there might be a scraping issue resulting in review text being repeated |
| Animal Crossing | 46:05 | (Unsuccessfully) Using str_sub function to help fix repeated review text by locating where in the review text starts being repeated |
| Animal Crossing | 48:20 | (Unsuccessfully) Using str_replace and map2_chr functions, as well as regex cpaturing groups to fix repeated text |
| Animal Crossing | 52:00 | Looking at the association between review grade and gamma of the topic model (how "strong" a review represents a topic) |
| Animal Crossing | 53:55 | Using cor function with method = "spearman" to calculate correlation based on rank instead of actual values |
| Animal Crossing | 57:35 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Volcano Eruptions | 7:00 | Change the last_eruption_year into years_ago by using mutate from the dplyr package with years_ago = 2020 - as.numeric(last_eruption_year)). In the plot David includes +1 to account for 0 values in the years_ago variable. |
| Volcano Eruptions | 9:50 | Use str_detect from the stringr package to search the volcano_name variable for Vesuvius when not sure if spelling is correct. |
| Volcano Eruptions | 12:50 | Use the longitude and latitude to create a world map showing where the volcanoes are located. |
| Volcano Eruptions | 15:30 | Use fct_lump from theforcats package to lump together all primary_volcano_type factor levels except for the n most frequent. |
| Volcano Eruptions | 16:25 | Use str_remove from the stringr package with the regular expression "\\(.\\)" to remove the parentheses. |
| Volcano Eruptions | 18:30 | Use the leaflet package to create an interactive map with popup information about each volcano. |
| Volcano Eruptions | 24:10 | Use glue from the glue package to create an HTML string by concatenating volcano_name and primary_volcano_type between HTML <p></p> tags. |
| Volcano Eruptions | 27:15 | Use the DT package to turn the leaflet popup information into a datatable. |
| Volcano Eruptions | 31:40 | Use str_replace_all fromt he stringr package to replace all the underscores _ in volcano_name with space. Then use str_to_title from the stringr package to convert the volcano_name variable to title case. |
| Volcano Eruptions | 32:05 | Use kable with format = HTML from the knitr package instead of DT to make turning the data into HTML much easier. |
| Volcano Eruptions | 34:05 | Use paste0 from base R to bold the Volcano Name, Primary Volcano Type, and Last Eruption Year in the leaflet popup. |
| Volcano Eruptions | 34:50 | Use replace_na from the tidyr package to replace unknown with NA. |
| Volcano Eruptions | 37:15 | Use addMeasure from the leaflet package to add a tool to the map that allows for the measuring of distance between points. |
| Volcano Eruptions | 39:30 | Use colorNumeric from the leaflet package to color the points based on their population within 5km. To accomplish this, David creates 2 new variables: 1) transformed_pop to get the population on a log2 scale & 2) pop_color which uses the colorNumeric function to generate the color hex values based on transformed_pop. |
| Volcano Eruptions | 46:30 | Use the gganimate package to create an animated map. |
| Volcano Eruptions | 48:45 | Use geom_point from the ggplot2 package with size = .00001 * 10 ^ vei so the size of the points are then proportional to the volume metrics provided in the Volcano Eruption Index. The metrics are in Km^3. |
| Volcano Eruptions | 50:20 | Use scale_size_continuous from the ggplot2 package with range = c(.1, 6) to make the smaller points smaller and larger points larger. |
| Volcano Eruptions | 50:55 | Use scale_color_gradient2 from the ggplot2 package to apply color gradient to each point based on the volcano size and whether its low or high. |
| Volcano Eruptions | 59:40 | Summary of screencast while waiting for gganimate map to render. Also, brief discussion on using transition_reveal instead of transition_time to keep the point on the map instead of replacing them in each frame. |
| Screencast | Time | Description |
|---|---|---|
| Beach Volleyball | 5:30 | Use pivot_longer from the dplyr package to pivot the data set from wide to long. |
| Beach Volleyball | 7:20 | Use mutate_at from the dplyr package with starts_with to change the class to character for all columns that start with w_ and l_. |
| Beach Volleyball | 8:00 | Use separate from the tidyr package to separate the name variable into three columns with extra = merge and fill = right. |
| Beach Volleyball | 10:35 | Use rename from the dplyr package to rename w_player1, w_player2, l_player1, and l_player2. |
| Beach Volleyball | 12:50 | Use pivot_wider from the dplyr package to pivot the name variable from long to wide. |
| Beach Volleyball | 15:15 | Use str_to_upper to convert the winner_loser w and l values to uppercase. |
| Beach Volleyball | 20:25 | Add unique row numbers for each match using mutate with row_number from the dplyr package. |
| Beach Volleyball | 21:20 | Separate the score values into multiple rows using separate_rows from the tidyr package. |
| Beach Volleyball | 22:45 | Use separate from the tidyr package to actual scores into two columns, one for the winners score w_score and another for the losers score l_score. |
| Beach Volleyball | 23:45 | Use na_if from the dplyr package to change the Forfeit or other value from the score variable to NA. |
| Beach Volleyball | 24:35 | Use str_remove from the stringr package to remove scores that include retired. |
| Beach Volleyball | 25:25 | Determine how many times the winners score w_score is greter than the losers score l_score at least 1/3 of the time. |
| Beach Volleyball | 28:30 | Use summarize from the dplyr package to create the summary statistics including the number of matches, winning percentage, date of first match, date of most recent match. |
| Beach Volleyball | 34:15 | Use type_convert from the readr package to convert character class variables to numeric. |
| Beach Volleyball | 35:00 | Use summarize_all from the dplyr package to calculate the calculate which fraction of the data is not NA. |
| Beach Volleyball | 42:00 | Use summarize from the dplyr package to determine players number of matches, winning percentage, average attacks, average errors, average kills, average aces, average serve errors, and total rows with data for years prior to 2019. The summary statistics are then used to answer how would we could predict if a player will win in 2019 using geom_point and logistic regression. Initially, David wanted to predict performance based on players first year performance. (NOTE - David mistakingly grouped by year and age. He cathces this around 1:02:00.) |
| Beach Volleyball | 49:25 | Use year from the lubridate package within a group_by to determine the age for each play given their birthdate. |
| Beach Volleyball | 54:30 | Turn the summary statistics at timestamp 42:00 into a . DOT %>% PIPE function. |
| Beach Volleyball | 1:04:30 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Cocktails | 6:20 | Use fct_reorder from the forcats package to reorder the ingredient factor levels along n. |
| Cocktails | 7:40 | Use fct_lump from the forcats package to lump together all the levels except the n most frequent in the category and ingredient variables. |
| Cocktails | 11:30 | Use pairwise_cor from the widyr package to find the correlation between the ingredients. |
| Cocktails | 16:00 | Use reorder_within from the tidytext package with scale_x_reordered to reorder the the columns in each facet. |
| Cocktails | 19:45 | Use the ggraph and igraph packages to create a network diagram |
| Cocktails | 25:15 | Use extract from the tidyr package with regex = (.*) oz to create a new variable amount which doesn't include the oz. |
| Cocktails | 26:40 | Use extract with regex to turn the strings in the new amount variable into separate columns for the ones, numerator, and denominator. |
| Cocktails | 28:53 | Use replace_na from the tidyr package to replace NA with zeros in the ones, numberator, and denominator columns. David ends up reaplcing the zero in the denominator column with ones in order for the calculation to work. |
| Cocktails | 31:49 | Use geom_text_repel from the ggrepel package to add ingredient labels to the geom_point plot. |
| Cocktails | 32:30 | Use na_if from the dplyr package to replace zeros with NA |
| Cocktails | 34:25 | Use scale_size_continuous with labels = percent_format() to convert size legend values to percent. |
| Cocktails | 36:35 | Change the size of the points in the network diagram proportional to n using vertices = ingredient_info within graph_from_data_frame and aes(size = n) within geom_node_point. |
| Cocktails | 48:05 | Use widely_svd from the widyr package to perform principle component analysis on the ingredients. |
| Cocktails | 52:32 | Use paste0 to concatenate PC and dimension in the facet panel titles. |
| Cocktails | 57:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| African-American Achievements | 8:20 | Use fct_reorder from the forcats package to reorder the category factor levels by sorting along n. |
| African-American Achievements | 11:35 | Use str_remove from the stringr package to remove anything after a bracket or parenthesis from the person variable with the regular expression "[\\[\\(].*" David then discusses how web scraping may be a better option than parsing the strings. |
| African-American Achievements | 12:25 | Use str_trim from the stringr package to remove the whitespace from the person variable. David then discusses how web scraping may be a better option than parsing the strings. |
| African-American Achievements | 15:50 | Create an interactive plotly timeline. |
| African-American Achievements | 18:20 | Use ylim(c(-.1, 1)) to set scale limits moving the geom_point to the bottom of the graph. |
| African-American Achievements | 19:30 | Use paste0 from base R to concatenate the accomplishment and person with ": " in between the two displayed in the timeline hover label. |
| African-American Achievements | 20:30 | Set y to category in ggplot aesthetics to get 8 separate timelines on one plot, one for each category. Doing this allows David to remove the ylim mentioned above. |
| African-American Achievements | 22:25 | Use the plotly tooltip = text parameter to get just a single line of text in the plotly hover labels. |
| African-American Achievements | 26:05 | Use glue from the glue package to reformat text with \n included so that the single line of text can now be broken up into 2 separate lines in the hover labels. |
| African-American Achievements | 33:55 | Use separate_rows from the tidyr package to separate the occupation_s variable from the science dataset into multiple columns delimited by a semicolon with sep = "; " |
| African-American Achievements | 34:25 | Use str_to_title from the stringr package to conver the case to title case in the occupation_s variable. |
| African-American Achievements | 35:15 | Use str_detect from the stringr package to detect the presence of statistician from within the occupation_s variable with regex("statistician", ignore_case = TRUE) to perform a case-insensitive search. |
| African-American Achievements | 41:55 | Use the rvest package with Selector Gadget to scrape additional information about the individual from their Wikipedia infobox. |
| African-American Achievements | 49:15 | Use map and possibly from the purrr package to separate out the downloading of data from parsing the useful information. David then turns the infobox extraction step into an anonymous function using .%>% dot-pipe. |
| African-American Achievements | 58:40 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| African-American History | 6:55 | Use fct_lump from the forcats package to lump together all the factor levels in ship_name except the n most frequent. Used within filter with ! = "Other" to remove other. |
| African-American History | 8:00 | use fct_reorder from the forcats package to reorder the ship_name factor levels y sorting along the n_slaves_arrived variable. |
| African-American History | 10:20 | Add geom_vline to geom_histogram to annotate the plot with a vertical line indicating the Revolutionary War and the Civil War. |
| African-American History | 13:00 | Use truncated division within count to create a new decade variable equal to 10 * (year_arrival %/% 10)) |
| African-American History | 17:20 | Use str_trunc from the stringr package to truncate the titles in each facet panel accounting for the slave ports with really long names. |
| African-American History | 18:05 | Another option for accounting for long titles in the facet panels is to use strip.text within theme with element_text(size = 6) |
| African-American History | 26:55 | Use the ggraph package to create a network diagram using port_origin and port_arrival. |
| African-American History | 29:05 | Use arrow from the grid package to add directional arrows to the points in the network diagram. |
| African-American History | 29:40 | Use scale_width_size_continuous from the ggraph packge to adjust the size of the points in the network diagram. |
| African-American History | 35:25 | Within summarize use mean(n_slaves_arrived, na.rm = TRUE) * n()) to come up with an estimated total numer of slaves since 49% of the data is missing. |
| African-American History | 48:20 | Create a faceted stacked percent barplot (spinogram) showing the percentage of black_free, black_slaves, white, and other for each region. |
| African-American History | 51:00 | Use the wordcloud package to create a wordcloud with the african_names dataset. David hsa issues with the wordcloud package and opts to use ggwordcloud instead. Also, mentions the worldcloud2 package. |
| African-American History | 55:20 | Use fct_recode from the forcats package to change the factor levels for the gender variable while renaming Man = "Boy" and Woman = "Girl" |
| African-American History | 57:20 | Use reorder_within from the tidytext package to reorder the geom_col by n within gender variable for each facet panel. |
| African-American History | 59:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Caribou Locations | 4:00 | Use summarize and across to calculate the proportion of NA values in the individuals dataset. Note, you do not need to use list(). |
| Caribou Locations | 9:00 | Use ggplot and borders from the ggplot2 package to create a map of Canada with deploy_on_longitude and deploy_on_latitude from the individuals dataset. |
| Caribou Locations | 13:50 | Import Canada province shapefile using the sf package. [Unsuccessful] |
| Caribou Locations | 25:00 | Use min and max from base r within summarize to find out the start and end dates for each caribou in the locations dataset. |
| Caribou Locations | 27:15 | Use sample from base r to pick one single caribou at a time then use the subset with geom_path from ggplot2 to track the path a that caribou takes over time. color = factor(floor_date(timestamp, "quarter") is used to color the path according to what quarter the observation occured in. |
| Caribou Locations | 35:15 | Use as.Date from base r and floor_date from the lubridate package to convert timestamp variable into quarters then facet_wrap the previous plot by quarter. |
| Caribou Locations | 37:15 | Within mutate, use as.numeric(difftime(timestamp, lag(timestamp), unit = "hours")) from base r to figure out the gap in time between observations. |
| Caribou Locations | 43:05 | Use distHaversine from the geosphere package to calculate distance in km then convert it to speed in kph. |
| Caribou Locations | 1:00:00 | Summary of dataset. |
| Screencast | Time | Description |
|---|---|---|
| X-Men Comics | 07:25 | Using separate to separate the name from secrete identity in the character column |
| X-Men Comics | 09:55 | Using summarize and across to find the frequency of the action variables and find out how many issues each action was used for each character |
| X-Men Comics | 13:25 | Create a geom_col chart to visualize which character speaks in the most issues |
| X-Men Comics | 18:35 | Create a geom_point chart to visualize each character’s average lines per issue in which the character is depicted |
| X-Men Comics | 22:05 | Create a geom_point chart to visualize each character’s average thoughts per issue in which the character is depicted |
| X-Men Comics | 23:10 | Create a geom_point chart to visualize character’s speech versus thought ratio per issue in which the character is depicted |
| X-Men Comics | 30:05 | Create a geom_point to visualize character’s number of lines while in costume versus not in costume |
| X-Men Comics | 34:30 | Create a geom_point chart to visualize the lines in costume versus lines out of costume ratio |
| X-Men Comics | 39:20 | Create a lollipop graph using geom_point and geom_errorbarh to visualize the lines in costume versus lines out of costume ratio and their distance from 1.0 (1 to 1) |
| X-Men Comics | 45:00 | Use summarize to find the frequency of each location and the total number of unique issues where the location is used |
| X-Men Comics | 46:00 | Use summarize and fct_lump to count how many issues each author has written while lumping together all authors except the most frequent |
| X-Men Comics | 47:25 | Use summarize and fct_lump to see if the authors rates of passing the Bechdel test differ from one another |
| X-Men Comics | 52:45 | Create a geom_line chart to visualize if the rates of passing the Bechdel test changed over time and floor division %/% to generate 20 observations per group |
| X-Men Comics | 54:35 | Create a geom_col to visualize the amount of lines each character has per issue over time giving context to Bechdel test passing rates |
| X-Men Comics | 1:00:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Coffee Ratings | 08:15 | Using fct_lump within count and then mutate to lump the variety of coffee together except for the most frequent |
| Coffee Ratings | 08:50 | Create a geom_boxplot to visualize the variety and the distribution of total_cup_points |
| Coffee Ratings | 09:55 | Create a geom_histogram to visualize the variety and the distribution of total_cup_points |
| Coffee Ratings | 11:40 | Using fct_reorder to reorder variety by sorting it along total_cup_points in ascending order |
| Coffee Ratings | 12:35 | Using summarize with across to calculate the percent of missing data (NA) for each rating variable |
| Coffee Ratings | 15:20 | Create a bar chart using geom_col with fct_lump to visualize the frequency of top countries |
| Coffee Ratings | 20:35 | Using pivot_longer to pivot the rating metrics for wide format to long format |
| Coffee Ratings | 21:30 | Create a geom_line chart to see if the sum of the rating categories equal to the total_cup_points column |
| Coffee Ratings | 23:10 | Create a geom_density_ridges chart to show the distribution of ratings across each rating metric |
| Coffee Ratings | 24:35 | Using summarize with mean and sd to show the average rating per metric with its standard deviation |
| Coffee Ratings | 26:15 | Using pairwise_cor to find correlations amongst the rating metrics |
| Coffee Ratings | 27:20 | Create a network plot to show the clustering of the rating metrics |
| Coffee Ratings | 29:35 | Using widely_svd to visualize the biggest source of variation with the rating metrics (Singular value decomposition) |
| Coffee Ratings | 37:40 | Create a geom_histogram to visualize the distribution of altitude |
| Coffee Ratings | 40:20 | Using pmin to set a maximum numeric altitude value of 3000 |
| Coffee Ratings | 41:05 | Create a geom-point chart to visualize the correlation between altitude and quality (total_cup_points) |
| Coffee Ratings | 42:00 | Using summarize with cor to show the correlation between altitude and each rating metric |
| Coffee Ratings | 44:25 | Create a linear model lm for each rating metric then visualize the results using a geom_line chart to show how each kilometer of altitude contributes to the score |
| Coffee Ratings | 50:35 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Australian Animal Outcomes | 1:20 | Using use_tidytemplate to open the project dataset with the package's tidytemplate Rmd |
| Australian Animal Outcomes | 4:30 | Using rename to rename Total column to total |
| Australian Animal Outcomes | 6:20 | Using fct_reorder to reorder stacked barplot with weight = sum |
| Australian Animal Outcomes | 7:00 | Using fct_lump with w = n to lump together outcome factor levels displaying the most frequenct with rest lumped into other |
| Australian Animal Outcomes | 9:15 | Using fct_recode to combine the factor level In Stock with Currently In Care |
| Australian Animal Outcomes | 12:10 | Using fct_reorder to reorder facet_wrap panels |
| Australian Animal Outcomes | 13:03 | Using scale_y_continuous with labels = comma to separate digits with comma |
| Australian Animal Outcomes | 14:10 | Using complete to complete account for missing combinations of data where the value is 0 in the released column |
| Australian Animal Outcomes | 16:10 | Using max (year) within filter to subset the data displaying only the most recent year |
| Australian Animal Outcomes | 19:30 | Using pivot_longer to pivot location variables from wide to long |
| Australian Animal Outcomes | 21:45 | Web Scaraping table from Wikipedia with SelectorGadget and Rvest |
| Australian Animal Outcomes | 25:45 | Using str_to_upper to upper case the values in the shorthand column |
| Australian Animal Outcomes | 27:13 | Using parse_number to remove commas from population and area columns |
| Australian Animal Outcomes | 28:55 | Using bind_rows to bind the two web scraped tables from Wikipedia together by row and column |
| Australian Animal Outcomes | 29:35 | Using inner_join to combine the Wikipedia table with the original data set |
| Australian Animal Outcomes | 29:47 | Using mutate to create new per_capita_million column to show outcome on a per million people basis |
| Australian Animal Outcomes | 37:25 | Using summarize to create new column pct_euthanized showing percent of cats and dogs euthanized over time. Formula accounts for 0 values thus avoiding a resulting empty vector. |
| Australian Animal Outcomes | 39:10 | Using scale_y_continuous with labels = percent to add percentage sign to y-axis values |
| Australian Animal Outcomes | 42:45 | Create a choropleth map of Australia using an Australian States Shapefile using the sf and ggplot2 packages |
| Australian Animal Outcomes | 55:45 | Add animation to the map of Australia showing the percent of cats euthanized by region using gganimate |
| Australian Animal Outcomes | 1:01:35 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Palmer Penguins | 11:17 | Create a pivoted histogram plot to visualize the distribution of penguin metrics using pivot_longer, geom_histogram, and facet_wrap |
| Palmer Penguins | 14:40 | Create a pivoted density plot to visualize the distribution of penguin metrics using geom_density and facet_wrap |
| Palmer Penguins | 15:21 | Create a pivoted boxplot plot to visualize the distribution of penguin metrics using geom_boxplot and facet_wrap |
| Palmer Penguins | 17:50 | Create a bar plot to show penguin species changed over time |
| Palmer Penguins | 18:25 | Create a bar plot to show specie counts per island |
| Palmer Penguins | 20:00 | Create a logistic regression model to predict if a penguin is Adelie or not using bill length with cross validaiton of metrics |
| Palmer Penguins | 39:35 | Create second logistic regression model using 4 predictive metrics (bill length, bill depth, flipper length, body mass) and then compare the accuracy of both models |
| Palmer Penguins | 43:25 | Create a k-nearest neighbor model and then compare accuracy against logistic regression models to see which has the highest cross validated accuracy |
| Palmer Penguins | 53:05 | What is the accuracy of the testing holdout data on the k-nearest neighbor model? |
| Palmer Penguins | 1:05:40 | Create a decision tree and then compare accuracy against the previous models to see which has the highest cross validated accuracy + how to extract a decision tree |
| Palmer Penguins | 1:10:45 | Perform multi class regression using multinom_reg |
| Palmer Penguins | 1:19:40 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| European Energy | 01:50 | Using count to get an overview of scategorical data |
| European Energy | 07:25 | Using pivot_longer and gather to pivot date variables from wide to long |
| European Energy | 09:00 | Using as.integer to change year variable from character to integer class |
| European Energy | 10:10 | Using fct_reorder to reorder stacked barplot |
| European Energy | 10:30 | Using scale_y_continuous with labels = comma from scales package to insert a comma every three digits on the y-axis |
| European Energy | 16:35 | Using replace_na and list to replace NA values in country_name column with United Kingdom |
| European Energy | 18:05 | Using fct_lump to lump factor levels together except for the 10 most frequent for each facet panel |
| European Energy | 20:10 | Using reorder_within with fun = sum and scale_y_reordered to reorder the categories within each facet panel |
| European Energy | 24:30 | Using ggflags package to add country flags |
| European Energy | 29:20 | (Unsuccessfully) Using fct_recode to rename the ISO two-digit identifier for the United Kingdom from the UK to GB |
| European Energy | 33:20 | Using ifelse to replace the ISO two-digit identifier for the United Kingdom from UK to GB & from EL to GR fro Greece |
| European Energy | 40:45 | Using str_to_lower to convert observations in country column to lower case |
| European Energy | 45:00 | Creating a slope graph to show differences in Nuclear production (2106 versus 2018) |
| European Energy | 47:00 | Using scale_x_continuous with breaks = c(2016, 2018) to show only 2016 and 2018 on x-axis |
| European Energy | 48:20 | Extend x-axis limits using scale_x_continuous with limits = c(2015, 2019) and geom_text with an ifelse within hjust to alternate labels for the right and left side of slope graph |
| European Energy | 52:40 | Creating a slopegraph function |
| European Energy | 1:00:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Plants in Danger | 2:00 | Getting an overview of categorical data |
| Plants in Danger | 5:00 | Using fct_relevel to reorder the "Before 1900" level to the first location leaving the other levels in their existing order |
| Plants in Danger | 8:05 | Using n and sum in fct_reorder to reorder factor levels when there are multiple categories in count |
| Plants in Danger | 12:00 | Using reorder_within and scale_y_reordered such that the values are ordered within each facet |
| Plants in Danger | 14:55 | Using axis.text.x to rotate overlapping labels |
| Plants in Danger | 19:05 | Using filter and fct_lump to lump all levels except for the 8 most frequest facet panels |
| Plants in Danger | 26:55 | Using separate to separate the character column binomial_name into multiple columns (genus and species) |
| Plants in Danger | 28:20 | Using fct_lump within count to lump all levels except for the 8 most frequent genus |
| Plants in Danger | 45:30 | Using rvest and SelectorGadget to web scrape list of species |
| Plants in Danger | 49:35 | Using str_trim to remove whitespace from character string |
| Plants in Danger | 50:00 | Using separate to separate character string into genus, species, and rest/citation columns and using extra = "merge" to merge extra pieces into the rest/citation column |
| Plants in Danger | 51:00 | Using rvest and SelectorGadget to web scrape image links |
| Plants in Danger | 57:50 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Chopped | 5:20 | Use geom_histogram to visualize the distribution of episode ratings. |
| Chopped | 6:30 | Use geom_point and geom_line with color = factor(season) to visualize the episode rating for every episode. |
| Chopped | 7:15 | Use group_by and summarize to show the average rating for each season and the number of episodes in each season. |
| Chopped | 7:15 | Use geom_line and geom_point with size = n_episodes to visualize the average rating for each season with point size indicating the total number of episodes (larger = more episodes, smaller = fewer episodes). |
| Chopped | 10:55 | Use fct_reorder to reorder the episode_name factor levels by sorting along the episode_rating variable. |
| Chopped | 10:55 | Use geom_point to visualize the top episodes by rating. Use the 'glue' package to place season number and episode number before episode name on the y axis. |
| Chopped | 15:20 | Use pivot_longer to combine ingredients into one single column. Use separate_rows with sep = ", " to separate out the ingredients with each ingredient getting its own row. |
| Chopped | 18:10 | Use fct_lump to lump ingredients together except for the 10 most frequent. Use fct_reorder to reorder ingredient factor levels by sorting against n. |
| Chopped | 18:10 | Use geom_col to create a stacked bar plot to visualize the most common ingredients by course. |
| Chopped | 19:45 | Use fct_relevel to reorder course factor levels to appetizer, entree, dessert. |
| Chopped | 21:00 | Use fct_rev and scale_fill_discrete with guide = guide_legend(reverse = TRUE) to reorder the segments within the stacked bar plot. |
| Chopped | 23:20 | Use the widyr package and pairwise_cor to find out what ingredients appear together. Mentioned: David Robinson - The widyr Package YouTube Talk at 2020 R Conference |
| Chopped | 26:20 | Use ggraph , geom_edge_link, geom_node_point, geom_node_text to create an ingredient network diagram to show their makeup and how they interact. |
| Chopped | 28:00 | Use pairwise_count from widyr to count the number of times each pair of items appear together within a group defined by feature. |
| Chopped | 30:15 | Use unite from the tidyr package in order to paste together the episode_course and series_episode columns into one column to figure out if any pairs of ingredients appear together in the same course across episodes. |
| Chopped | 31:55 | Use summarize with min, mean, max, and n() to create the first_season, avg_season, last_season and n_appearances variables. |
| Chopped | 34:35 | Use slice with tail to get the n ingredients that appear in early and late seasons. |
| Chopped | 35:40 | Use geom_boxplot to visualize the distribution of each ingredient across all seasons. |
| Chopped | 36:50 | Fit predictive models (linear regression , random forest, and natural spline) to determine if episode rating is explained by the ingredients or season. Use pivot_wider with values_fill = list(value = 0)) with 1 indicating ingredient was used and 0 indicating it wasn't used. |
| Chopped | 1:17:25 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Global Crop Yields | 03:35 | Using rename to shorten column name |
| Global Crop Yields | 06:40 | Using rename_all with str_remove and regex to remove characters in column name |
| Global Crop Yields | 07:40 | Using pivot_longer to change data from wide to long |
| Global Crop Yields | 08:25 | Create a faceted geom_line chart |
| Global Crop Yields | 09:40 | Using fct_reorder to reorder facet panels in ascending order |
| Global Crop Yields | 11:50 | Create an interactive Shiny dashboard |
| Global Crop Yields | 33:20 | Create a faceted geom_line chart with add_count and filter(n = max(x)) to subset the data for crops that have observations in every year |
| Global Crop Yields | 36:50 | Create a faceted geom_point chart showing the crop yields at start and end over a 50 year period (1968 start date and 2018 end date) |
| Global Crop Yields | 45:00 | Create a geom_boxplot to visualize the distribution of yield ratios for the different crops to see how efficiency has increased across countries |
| Global Crop Yields | 46:00 | Create a geom_col chart to visualize the median yield ratio for each crop |
| Global Crop Yields | 47:50 | Create a geom_point chart to visualize efficiency imporvement for each country for a specific crop (yield start / yield ratio) |
| Global Crop Yields | 50:25 | Using the countrycode package to color geom_point chart by continent names |
| Global Crop Yields | 56:50 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Friends | 7:30 | Use dplyr package's count function to count the unique values of multiple variables. |
| Friends | 9:35 | Use geom_col to show how many lines of dialogue there is for each character. Use fct_reorder to reorder the speaker factor levels by sorting along n. |
| Friends | 12:07 | Use semi_join to join friends dataset with main_cast with by = ""speaker returning all rows from friends with a match in main_cast. |
| Friends | 12:30 | Use unite to create the episode_number variable which pastes together season and episode with sep = ".". Then, use inner_join to combine above dataset with friends_info with by = c("season", "episode"). Then, use mutate and the glue package instead to combine { season }.{ episode } { title }. Then use fct_reorder(episode_title, season + .001 * episode) to order it by season first then episode. |
| Friends | 15:45 | Use geom_point to visualize episode_title and us_views_millions. Use as.integer to change episode_title to integer class. Add labels to geom_point using geom_text with check_overlap = TRUE so text that overlaps previous text in the same layer will not be plotted. |
| Friends | 19:95 | Run the above plot again using imdb_rating instead of us_views_millions |
| Friends | 21:35 | Ahead of modeling: Use geom_boxplot to visualize the distribution of speaking for main characters. Use the complete function with fill = list(n = 0) to replace existing explicit missing values in the data set. Demonstration of how to account for missing imdb_rating values using the fill function with .direction = "downup" to keep the imdb rating across the same title. |
| Friends | 26:45 | Ahead of modeling: Use summarize with cor(log2(n), imdb_rating) to find the correlation between speaker and imdb rating -- the fact that the correlation is positive for all speakers gives David a suspicion that some episodes are longer than others because they're in 2 parts with higher ratings due to important moments. David addresses this confounding factor by including percentage of lines instead of number of lines. Visualize results with geom_boxplot, geom_point with geom_smooth. |
| Friends | 34:05 | Use a linear model to predict imdb rating based on various variables. |
| Friends | 42:00 | Use the tidytext and tidylo packages to see what words are most common amongst characters, and whether they are said more times than would be expected by chance. Use geom_col to visualize the most overrepresented words per character according to log_odds_weighted. |
| Friends | 54:15 | Use the widyr package and pairwise correlation to determine which characters tend to appear in the same scences together? Use geom_col to visualize the correlation between characters. |
| Friends | 1:00:25 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Government Spending on Kids | 6:15 | Using geom_line and summarize to visualize education spending over time. First for all states. Then individual states. Then small groups of states using %in%. Then in random groups of size n using %in% and sample with unique. fct_reorder is used to reorder state factor levels by sorting along the inf_adj variable. geom_vline used to add reference to the 2009 financial crisis. |
| " | Government Spending on Kids | 16:00 |
| Government Spending on Kids | 23:35 | Create a function named plot_changed_faceted to make it easier to visualize the many other variables included in the dataset. |
| Government Spending on Kids | 27:25 | Create a function named plot_faceted with a {{ y_axis }} embracing argument. Adding this function creates two stages: one for data transformation and another for plotting. |
| Government Spending on Kids | 37:05 | Use the dir function with pattern and purrr package's map_df function to read in many different .csv files with GDP values for each state. Troubleshooting Can't combine <character> and <double> columns error using function and mutate with across and as.numeric. Extract state name from filename using extract from tidyr and regular expression. |
| Government Spending on Kids | 50:50 | Unsuccessful attempt at importing state population data via a not user friendly dataset from census.gov by skipping the first 3 rows of the Excel file. |
| Government Spending on Kids | 54:22 | Use geom_col to see which states spend the most for each child for a single variable and multiple variables using %in%. Use scale_fill_discrete with guide_legend(reverse = TRUE) to change the ordering of the legend. |
| Government Spending on Kids | 57:40 | Use geom_col and pairwise_corr to visualize the correlation between variables across states in 2016 using pairwise correlation. |
| Government Spending on Kids | 1:02:02 | Use geom_point to plot inf_adjust_perchild_PK12ed versus inf_adj_perchild_highered. geom_text used to apply state names to each point. |
| Government Spending on Kids | 1:05:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Himalayan Climbers | 3:00 | Create a geom_col chart to visualize the top 50 tallest mountains. Use fct_reorder to reorder the peak_name factor levels by sorting along the height_metres variable. |
| Himalayan Climbers | 8:50 | Use summarize with across to get the total number of climbs, climbers, deaths, and first year climbed. Use mutate to calculate the percent death rate for members and hired staff. Use inner_join and select to join with peaks dataset by peak_id. |
| Himalayan Climbers | 11:20 | Touching on statistical noise and how it impacts the death rate for mountains with fewer number of climbs, and how to account for it using various statistical methods including Beta Binomial Regression & Empirical Bayes. |
| Himalayan Climbers | 14:30 | Further description of Empirical Bayes and how to account for not overestimating death rate for mountains with fewer climbers. Recommended reading: Introduction to Empirical Bayes: Examples from Baseball Statistics by David Robinson. |
| Himalayan Climbers | 17:00 | Use the ebbr package (Empirical Bayes for Binomial in R) to create an Empirical Bayes Estimate for each mountain by fitting prior distribution across data and adjusting the death rates down or up based on the prior distributions. Use a geom_point chart to visualize the difference between the raw death rate and new ebbr fitted death rate. |
| Himalayan Climbers | 21:20 | Use geom_point to visualize how deadly each mountain is with geom_errorbarh representing the 95% credible interval between minimum and maximum values. |
| Himalayan Climbers | 26:35 | Use geom_point to visualize the relationship between death rate and height of mountain. There is not a clear relationship, but David does briefly mention how one could use Beta Binomial Regression to further inspect for possible relationships / trends. |
| Himalayan Climbers | 28:00 | Use geom_histogram and geom_boxplot to visualize the distribution of time it took climbers to go from basecamp to the mountain’s high point for successful climbs only. Use mutate to calculate the number of days it took climbers to get from basecamp to the highpoint. Add column to data using case_when and str_detect to identify strings in termination_reason that contain the word Success and rename them to Success & how to use a vector and %in% to change multiple values in termination_reason to NA and rest to Failed. Use fct_lump to show the top 10 mountains while lumping the other factor levels (mountains) into other. |
| Himalayan Climbers | 35:30 | For just Mount Everest, use geom_histogram and geom_density with fill = success to visualize the days from basecamp to highpoint for climbs that ended in success, failure or other. |
| Himalayan Climbers | 38:40 | For just Mount Everest, use geom_histogram to see the distribution of climbs per year. |
| Himalayan Climbers | 39:55 | For just Mount Everest, use ‘geom_lineandgeom_pointto visualizepct_deathover time by decade. Usemutatewithpmaxandinteger division` to create a decade variable that lumps together the data for 1970 and before. |
| Himalayan Climbers | 41:30 | Write a function for summary statistics such as n_climbs, pct_success, first_climb, pct_death, ‘pct_hired_staff_death`. |
| Himalayan Climbers | 46:20 | For just Mount Everest, use geom_line and geom_point to visualize pct_success over time by decade. |
| Himalayan Climbers | 47:10 | For just Mount Everest, use geom_line and geom_point to visualize pct_hired_staff_deaths over time by decade. David decides to visualize the pct_hired_staff_deaths and pct_death charts together on the same plot. |
| Himalayan Climbers | 50:45 | For just Mount Everest, fit a logistic regression model to predict the probability of death with format.pval to calculate the p.value. Use fct_lump to lump together all expedition_role factors except for the n most frequent. |
| Himalayan Climbers | 56:30 | Use group_by with integer division and summarize to calculate n_climbers and pct_death for age bucketed into decades. |
| Himalayan Climbers | 59:45 | Use geom_point and geom_errorbarh to visualize the logistic regression model with confident intervals. |
| Himalayan Climbers | 1:03:30 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Beyonce and Taylor Swift Lyrics | 7:50 | Use fct_reorder from the forcats package to reorder title factor levels by sorting along the sales variable in geom_col plot. |
| Beyonce and Taylor Swift Lyrics | 8:10 | Use labels = dollar from the scales package to format the geom_col x-axis values as currency. |
| Beyonce and Taylor Swift Lyrics | 11:15 | Use rename_all(str_to_lower) to convert variable names to lowercase. |
| Beyonce and Taylor Swift Lyrics | 12:45 | Use unnest_tokens from the tidytext package to split the lyrics into one-lyric-per-row. |
| Beyonce and Taylor Swift Lyrics | 13:00 | Use anti_join from the tidytext package to find the most common words int he lyrics without stop_words. |
| Beyonce and Taylor Swift Lyrics | 15:15 | Use bind_tf_idf from the tidytext package to determine tf - the proportion each word has in each album and idf - how specific each word is to each particular album. |
| Beyonce and Taylor Swift Lyrics | 17:45 | Use reorder_within with scale_y_reordered in order to reorder the bars within each facet panel. David replaces top_n with slice_max from the dplyr package in order to show the top 10 words with ties = FALSE. |
| Beyonce and Taylor Swift Lyrics | 20:45 | Use bind_log_odds from the tidylo package to calculate the log odds ratio of album and words, that is how much more common is the word in a specific album than across all the other albums. |
| Beyonce and Taylor Swift Lyrics | 23:10 | Use filter(str_length(word) <= 3) to come up with a list in order to remove common filler words like ah, uh, ha, ey, eeh, and huh. |
| Beyonce and Taylor Swift Lyrics | 27:00 | Use mdy from the lubridate package and str_remove(released, " \\(.*)")) from the stringr package to parse the dates in the released variable. |
| Beyonce and Taylor Swift Lyrics | 28:15 | Use inner_join from the dplyr package to join taylor_swift_words with release_dates. David ends up having to use fct_recode since the albums reputation and folklore were nor lowercase in a previous table thus excluding them from the inner_join. |
| Beyonce and Taylor Swift Lyrics | 28:30 | Use fct_reorder from the forcats package to reorder album factor levels by sorting along the released variable to be used in the faceted geom_col. |
| Beyonce and Taylor Swift Lyrics | 34:40 | Use bind_rows from hte dplyr package to bind ts with beyonce with unnest_tokens from the tidytext package to get one lyric per row per artist. |
| Beyonce and Taylor Swift Lyrics | 38:40 | Use bind_log_odds to figure out which words are more likely to come from a Taylor Swift or Beyonce song? |
| Beyonce and Taylor Swift Lyrics | 41:10 | Use slice_max from the dplyr package to select the top 100 words by num_words_total and then the top 25 by log_odds_weighted. Results are used to create a diverging bar chart showing which words are most common between Beyonce and Taylor Swift songs. |
| Beyonce and Taylor Swift Lyrics | 44:40 | Use scale_x_continuous to make the log_odds_weighted scale more interpretable. |
| Beyonce and Taylor Swift Lyrics | 50:45 | Take the previous plot and turn it into a lollipop graph with geom_point(aes(size = num_words_total, color = direction)) |
| Beyonce and Taylor Swift Lyrics | 53:05 | Use ifelse to change the 1x value on the x-axis to same. |
| Beyonce and Taylor Swift Lyrics | 54:15 | Create a geom_point with geom_abline to show the most popular words they use in common. |
| Beyonce and Taylor Swift Lyrics | 1:01:55 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| NCAA Women's Basketball | 15:00 | Use fct_relevel from the forcats package to order the factor levels for the tourney_finish variable. |
| NCAA Women's Basketball | 16:35 | Use geom_tile from the ggplot2 package to create a heatmap to show how far a particular seed ends up going in the tournament. |
| NCAA Women's Basketball | 20:35 | Use scale_y_continuous from the ggplot2 package with breaks = seq(1, 16) in order to include all 16 seeds. |
| NCAA Women's Basketball | 20:55 | Use geom_text from the ggplot2 package with label = percent(pct) to apply the percentage to each tile in the heatmap. |
| NCAA Women's Basketball | 21:40 | Use scale_x_discrete and scale_y_continuous both with expand = c(0, 0) to remove the space between the x and y axis and the heatmap tiles. David calls this flattening. |
| NCAA Women's Basketball | 32:15 | Use scale_y_reverse to flip the order of the y-axis from 1-16 to 16-1. |
| NCAA Women's Basketball | 34:45 | Use cor from the stats package to calculate the correlation between seed and tourney_finish. Then plotted to determine if there is a correlation over time. |
| NCAA Women's Basketball | 39:50 | Use geom_smooth with method = "loess" to add a smoothing line with confidence bound to aid in seeing the trend between seed and reg_percent. |
| NCAA Women's Basketball | 42:10 | Use fct_lump from the forcats package to lump together all the conference except for the n most frequent. |
| NCAA Women's Basketball | 42:55 | Use geom_jitter from the ggplot2 package instead of geom_boxplot to avoid overplotting which makes it easier to visualize the points that make up the distribution of the seed variable. |
| NCAA Women's Basketball | 47:05 | Use geom_smooth with method = "lm" to aid in seeing the trend between reg_percent and tourney_w. |
| NCAA Women's Basketball | 54:20 | Create a dot pipe function using . and %>% to avoid duplicating summary statistics with summarize. |
| NCAA Women's Basketball | 56:35 | Use glue from the glue package to concatenate together school and n_entries on the geo_col y-axis. |
| NCAA Women's Basketball | 59:50 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Great American Beer Festival | 8:20 | Use pivot_wider with values_fill = list(value =0)) from the tidyr package along with mutate(value = 1) to pivot the medal variable from long to wide adding a 1 for the medal type awarded and 0 for the remaining medal types in the row. |
| Great American Beer Festival | 11:25 | Use fct_lump from the forcats package to lump together all the beers except for the N most frequent. |
| Great American Beer Festival | 12:25 | Use str_to_upper from the stringr package to convert the case of the state variable to uppercase. |
| Great American Beer Festival | 12:25 | Use fct_relevel from the the forcats package in order to reorder the medal factor levels. |
| Great American Beer Festival | 13:25 | Use fct_reorder from the forcats package to sort beer_name factor levels by sorting along n. |
| Great American Beer Festival | 14:30 | Use glue from the glue package to concatenate beer_name and brewery on the y-axis. |
| Great American Beer Festival | 15:00 | Use ties.mthod = "first" within fct_lump to show only the first brewery when a tie exists between them. |
| Great American Beer Festival | 19:25 | Use setdiff from the dplyr package and the state.abb built in vector from the datasets package to check which states are missing from the dataset. |
| Great American Beer Festival | 21:25 | Use summarize from the dplyr package to calculate the number of medals with n_medals = n(), number of beers with n_distinct, number of gold medals with sum(), and weighted medal totals using sum(as.integer() because medal is an ordered factor, so 1 for each bronze, 2 for each silver, and 3 for each gold. |
| Great American Beer Festival | 26:05 | Import Craft Beers Dataset from Kaggle using read_csv from the readr package. |
| Great American Beer Festival | 28:00 | Use inner_join from the dplyr package to join together the 2 datasets from kaggle. |
| Great American Beer Festival | 29:40 | Use semi_join from the dplyr package to join together to see if the beer names match with the kaggle dataset. Ends up at a dead end with not enough matches between the datasets. |
| Great American Beer Festival | 33:05 | Use bind_log_odds from the tidylo package to show the representation of each beer category for each state compared to the categories across the other states. |
| Great American Beer Festival | 33:35 | Use complete from the tidyr package in order to turn missing values into explicit missing values. |
| Great American Beer Festival | 35:30 | Use reorder_within from the tidytext package and scale_y_reordered from the tidytext package in order to reorder the bars within each facet panel. |
| Great American Beer Festival | 36:40 | Use fct_reorder from the forcats package to reorder the facet panels in descending order. |
| Great American Beer Festival | 39:35 | For the previous plot, use fill = log_odds_weighted > 0 in the ggplot aes argument to highlight the positive and negative values. |
| Great American Beer Festival | 41:45 | Use add_count from the dplyr package to add a year_total variable which shows the total awards for each year. Then use this to calculate the percent change in totals medals per state using mutate(pct_year = n / year) |
| Great American Beer Festival | 44:40 | Use glm from the stats package to create a logistic regression model to find out if their is a statistical trend in the probability of award success over time. |
| Great American Beer Festival | 47:15 | Exapnd on the previous model by using the broom package to fit multiple logistic regressions across multiple states instead of doing it for an individual state at a time. |
| Great American Beer Festival | 50:25 | Use conf.int = TRUE to add confidence bounds to the logistic regression output then use it to create a TIE Fighter plot to show which states become more or less frequent medal winners over time. |
| Great American Beer Festival | 53:00 | Use the state.name dataset with match from base r to change state abbreviation to the state name. |
| Great American Beer Festival | 55:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| IKEA Furniture | 4:30 | Use fct_reorder from the forcats package to reorder the factor levels for category sorted along n. |
| IKEA Furniture | 6:00 | Brief explanation of why scale_x_log10 is needed given the distribution of category and price with geom_boxplot. |
| IKEA Furniture | 7:00 | Using geom_jitter with geom_boxplot to show how many items are within each category. |
| IKEA Furniture | 8:00 | Use add_count from the dplyr package and glue from the glue package to concatenate the category name with category_total on the geom_boxplot y-axis. |
| IKEA Furniture | 9:00 | Convert from Saudi Riyals to United States Dollars. |
| IKEA Furniture | 11:05 | Create a ridgeplot - AKA joyplot - using ggridges package showing the distribution of price across category. |
| IKEA Furniture | 12:50 | Discussion on distributions and when to use a log scale. |
| IKEA Furniture | 19:20 | Use fct_lump from the forcats package to lump together all the levels in category except for the n most frequent. |
| IKEA Furniture | 21:00 | Use scale_fill_discrete from the ggplot2 package with guide = guide_legend(reverse = TRUE) to reverse the fill legend. |
| IKEA Furniture | 24:20 | Use str_trim from the stringr package to remove whitespace from the short_description variable. David then decides to use str_replace_all instead with the following regular expression "\\s+", " " to replace all whitespace with a single space instead. |
| IKEA Furniture | 25:30 | Use separate from the tidyr package with extra = "merge" and fill = "right" to separate item description from item dimension. |
| IKEA Furniture | 26:45 | Use extract from the tidyr package with the regular expression "[\\d\\-xX]+) cm" to extract the numbers before cm. |
| IKEA Furniture | 29:50 | Use unite from the tidyr package to paste together the category and main_description columns into a new column named category_and_description. |
| IKEA Furniture | 32:45 | Calculate the volume given the depth, height, and width of each item in dataset in liters using depth * height * width / 1000. At 36:15, David decides to change to cubic meters instead using depth * height * width / 1000000. |
| IKEA Furniture | 44:20 | Use str_squish from the stringr package to remove whitespace from the start to the end of the short_description variable. |
| IKEA Furniture | 48:00 | Use lm from the stats package to create a linear model on a log, log scale to predict the price of an item based on volume + category. David then uses fct_relevel to reorder the factor levels for category such that tables & desks is first (starting point) since it's the most frequent item in the category variable and it's price distribution is in the middle. |
| IKEA Furniture | 53:00 | Use the broom package to turn the model output into a coefficient / TIE fighter plot. |
| IKEA Furniture | 56:20 | Use str_remove from the stringr package to remove category from the start of the strings on the y-axis using the regular expression "^category" |
| IKEA Furniture | 57:50 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Historical Phones | 2:15 | Use bind_rows from the dplyr package to combine the two data sets. |
| Historical Phones | 7:30 | Use group = interaction(type, country) within ggplot aes() to set the interaction type with every single country on one plot. |
| Historical Phones | 9:30 | Use semi_join from the dplyr package to join rows from phones with a match in country_sizes. |
| Historical Phones | 14:00 | Use quantile from the stats package within summarize to show the 25th, and 75th quantiles (interquartile range) on the plot. |
| Historical Phones | 17:50 | Import the wdi package (World Development Indicators from the World Bank) with extra = TRUE in order to get the iso3c code and income level for each country. |
| Historical Phones | 19:45 | Use inner_join from the dplyr package to join the WDI data with the phones data. |
| Historical Phones | 20:35 | Use fct_relevel from the forcats package to reorder income factor levels in ascending order. |
| Historical Phones | 21:05 | Create an anonymous function using . (dot). |
| Historical Phones | 29:30 | Use inner_join from the dplyr package to join the mobile data and landline data together with a geom_abline to see how different the total populations are between the two datasets. |
| Historical Phones | 31:00 | Use geom_hline to add a refrence line to the plot shwoing when each country crossed the 50 per 100 subscription mark. |
| Historical Phones | 35:20 | Use summarize from the dplyr package with min(year([Mobile >= 50])) to find the year in which each country crossed the 50 per 100 subscription mark. |
| Historical Phones | 35:20 | Use summarize from the dplyr package with max(Mobile) to find the peak number of mobile subscriptions per country. |
| Historical Phones | 35:20 | Use na_if from the dplyr package within summarize to change Inf to NA. |
| Historical Phones | 38:20 | Using the WDIsearch function to search the WDI package for proper GDP per capita indicator. Ended up using the NY.GDP.PCAP.PP.KD indicator. |
| Historical Phones | 39:05 | Adding the GDP data from the WDI package to the country_incomes table. |
| Historical Phones | 39:52 | Using the inner_join function from the dplyr package to join the phones table with the country_incomes table pulling in the gdp_per_capita variable. |
| Historical Phones | 42:25 | Using the WDIsearch function to search the WDI package for proper population indicator. Ended up using the SP.POP.TOTL indicator. |
| Historical Phones | 50:00 | Create an animated choropleth world map with fill = subscriptions. |
| Historical Phones | 1:00:00 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Riddler: Simulating a Circular Random Walk | 1:25 | Using sample() and cumsum() to simulate a random walk |
| Riddler: Simulating a Circular Random Walk | 2:30 | Using %% (modulo operator) to "close" the circle (set the number of people in the circle) |
| Riddler: Simulating a Circular Random Walk | 3:40 | Using crossing function to set up "tidy" simulation (gives you all possible combinations of values you provide it) |
| Riddler: Simulating a Circular Random Walk | 5:10 | Using distinct function and its .keep_all argument to get only the first unique set of the variables you give it |
| Riddler: Simulating a Circular Random Walk | 8:15 | Visualizing the number of steps it takes for the sauce to reach people at differents seats |
| Riddler: Simulating a Circular Random Walk | 13:40 | Visualizing the distribution of number of steps it takes to reach each seat |
| Riddler: Simulating a Circular Random Walk | 26:30 | Investigating the parabolic shape of average number of steps to reach a given seat |
| Riddler: Simulating a Circular Random Walk | 28:40 | Using lm and I functions to calculate formula of the parabola describing average number of steps |
| Riddler: Simulating a Circular Random Walk | 30:15 | Starting to vary the size of the table |
| Riddler: Simulating a Circular Random Walk | 38:45 | Summary of screencast |
| Screencast | Time | Description |
|---|---|---|
| Ninja Warrior | 2:35 | Inspecting the dataset |
| Ninja Warrior | 6:40 | Using geom_histogram to look at distribution of obstacles in a stage |
| Ninja Warrior | 9:05 | Using str_remove function to clean stage names (remove "(Regional/City)") |
| Ninja Warrior | 10:40 | Asking, "Are there obstacles that are more common in the Finals than Qualifying rounds?" |
| Ninja Warrior | 10:50 | Using bind_log_odds function from tidylo package to calculate log-odds of obstacles within a stage type |
| Ninja Warrior | 16:05 | Using unite function to combine two columns |
| Ninja Warrior | 18:20 | Graphing the average position of different obstacles with many, many tweaks to make it look nice |
| Ninja Warrior | 23:10 | Creating a stacked bar plot of which obstacles appear in which order |
| Ninja Warrior | 30:30 | Turning stacked bar plot visualization into a custom function |
| Ninja Warrior | 37:40 | Asking, "Is there data on how difficult an obstacle is?" |
| Ninja Warrior | 45:30 | Visualizing which obstacles appear in different seasons with geom_tile and a lot of tweaking |
| Ninja Warrior | 50:22 | Reviewing the result of the previous step (obstacles in different seasons) |
| Ninja Warrior | 59:25 | Summary of screencast |