data-screencasts/screencast-annotations at master · dgrtwo/data-screencasts

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Screencast Summary

Screencast	Date	Notable Topics	Annotated	Link	Data
College Majors and Income	2018-10-15	Graphing for EDA (Exploratory Data Analysis)	✔️	🔗	📈
Horror Movie Profits	2018-10-23	Graphing for EDA (Exploratory Data Analysis)	✔️	🔗	📈
R Downloads	2018-10-30	Data manipulation (especially time series using `lubridate` package)	✔️	🔗	📈
US Wind Turbines	2018-11-06	Animated map using `gganimate`	✔️	🔗	📈
Malaria Incidence	2018-11-12	Map visualization, Animated map using `gganimate` package	✔️	🔗	📈
Thanksgiving Dinner	2018-11-21	Survey data, Network graphing	✔️	🔗	📈
Maryland Bridges	2018-11-27	Data manipulation, Map visualization	✔️	🔗	📈
Medium Articles	2018-12-04	Text mining using `tidytext` package	✔️	🔗	📈
Riddler: Monte Carlo Simulation	2018-12-04	Simulation	✔️	🔗	📈
NYC Restaurant Inspections	2018-12-11	Multiple t-test models using `broom` package, Principal Component Analysis (PCA)	✔️	🔗	📈
Riddler: Simulating a Week of Rain	2018-12-12	Simulation	✔️	🔗	📈
Dolphins	2018-12-18	Survival analysis using `survival` package	✔️	🔗	📈
TidyTuesday Tweets	2019-01-07	Text mining using `tidytext` package	✔️	🔗	📈
TV Golden Age	2019-01-09	Data manipulation, Logistic regression	✔️	🔗	📈
Space Launches	2019-01-15	Graphing for EDA (Exploratory Data Analysis)	✔️	🔗	📈
US Incarceration	2019-01-25	Animated map using `gganimate` package, Dealing with missing data	✔️	🔗	📈
US Dairy Consumption	2019-01-29	Time series analysis, Forecasting using `sweep` package	✔️	🔗	📈
US PhDs	2019-02-22	Tidying very un-tidy data	✔️	🔗	📈
French Train Delays	2019-02-26	Heat map	✔️	🔗	📈
Women in the Workplace	2019-03-05	Interactive scatterplot using `plotly` and `shiny` packages	✔️	🔗	📈
Board Game Reviews	2019-03-15	Lasso regression using `glmnet` package	✔️	🔗	📈
Seattle Pet Names	2019-03-16	Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing	✔️	🔗	📈
Seattle Bike Counts	2019-04-05	Data manipulation (especially time series using `lubridate` package)	✔️	🔗	📈
Tennis Tournaments	2019-04-09	Data manipulation (especially using `dplyr` for groups within dataframes)	✔️	🔗	📈
Bird Collisions	2019-05-03	Bootstrapping	✔️	🔗	📈
Student Teacher Ratios	2019-05-10	`WDI` package (World Development Indicators)	✔️	🔗	📈
Nobel Prize Winners	2019-05-24	Data manipulation, Graphing for EDA (Exploratory Data Analysis)	✔️	🔗	📈
Plastic Waste	2019-05-27	Choropleth map	✔️	🔗	📈
Wine Ratings	2019-05-31	Text mining using `tidytext` package, Lasso regression using `glmnet` package	✔️	🔗	📈
Ramen Reviews	2019-06-04	Web scraping using `rvest` package	✔️	🔗	📈
Media Franchise Revenue	2019-06-22	Data manipulation (especially re-ordering factors)	✔️	🔗	📈
Women's World Cup	2019-07-22	Data manipulation and exploratory graphing	✔️	🔗	📈
Bob Ross Paintings	2019-08-12	Network graphs, Principal Component Analysis (PCA)	✔️	🔗	📈
Simpsons Guest Stars	2019-08-30	Text mining using `tidytext` package	✔️	🔗	📈
Pizza Ratings	2019-10-01	Statistical testing with `t.test`	✔️	🔗	📈
Car Fuel Efficiency	2019-10-15	Natural splines for regression	✔️	🔗	📈
Horror Movies	2019-10-22	ANOVA, Text mining using `tidytext` package, Lasso regression using `glmnet` package	✔️	🔗	📈
NYC Squirrel Census	2019-11-01	Map visualization using `ggmap` package	✔️	🔗	📈
CRAN Package Code	2019-12-30	Graphing for EDA (Exploratory Data Analysis)	✔️	🔗	📈
Riddler: Spelling Bee Honeycomb	2020-01-06	Simulation with matrixes	✔️	🔗	📈
The Office	2020-03-16	Text mining using `tidytext` package, Lasso regression using `glmnet` package	✔️	🔗	📈
COVID-19 Open Research Dataset (CORD-19)	2020-03-18	JSON formatted data	✔️	🔗	📈
CORD-19 Data Package	2020-03-19	R package development and documentation-writing	✔️	🔗	📈
R trick: Creating Pascal's Triangle with `accumulate()`	2020-03-29	`accumulate()` for recursive formulas	✔️	🔗	📈
Riddler: Simulating Replacing Die Sides	2020-03-30	`accumulate()` for simulation	✔️	🔗	📈
Beer Production	2020-04-01	`tidymetrics` package demonstrated, Animated map (`gganimate` package)	✔️	🔗	📈
Riddler: Simulating a Non-increasing Sequence	2020-04-06	Simulation	✔️	🔗	📈
Tour de France	2020-04-07	Survival analysis, Animated bar graph (`gganimate` package)	✔️	🔗	📈
Riddler: Simulating a Branching Process	2020-04-13	Simulation, Exponential and Geometric distributions	✔️	🔗	📈
GDPR Violations	2020-04-21	Data manipulation, Interactive dashboard with `shinymetrics` and `tidymetrics`	✔️	🔗	📈
Broadway Musicals	2020-04-28	Creating an interactive dashboard with `shinymetrics` and `tidymetrics`, moving windows, period aggregation	✔️	🔗	📈
Riddler: Simulating and Optimizing Coin Flipping	2020-05-03	Simulation	✔️	🔗	📈
Animal Crossing	2020-05-05	Text mining using `tidytext` package	✔️	🔗	📈
Volcano Eruptions	2020-05-12	Static map with `ggplot2`, Interactive map with `leaflet`, Animated map with `gganimate`	✔️	🔗	📈
Beach Volleyball	2020-05-19	Data cleaning, Logistic regression	✔️	🔗	📈
Cocktails	2020-05-26	Pairwise correlation, Network diagram, Principal component analysis (PCA)	✔️	🔗	📈
African-American Achievements	2020-06-09	`plotly` interactive timeline, Wikipedia web scraping	✔️	🔗	📈
African-American History	2020-06-16	Network diagram, Wordcloud	✔️	🔗	📈
Caribou Locations	2020-06-23	Maps with `ggplot2`, Calculating distance and speed with `geosphere`	✔️	🔗	📈
X-Men Comics	2020-06-30	Data manipulation, Lollipop graph, `floor` function	✔️	🔗	📈
Coffee Ratings	2020-07-07	Ridgeline plot, Pairwise correlation, Network plot, Singular value decomposition (SVD), Linear model	✔️	🔗	📈
Australian Animal Outcomes	2020-07-21	Data manipulation, Web scraping (`rvest` package) and `SelectorGadget`, Animated choropleth map	✔️	🔗	📈
Palmer Penguins	2020-07-08	Modeling (logistic regression, k-nearest neighbors, decision tree, multiclass logistic regression) with cross validated accuracy	✔️	🔗	📈
European Energy	2020-08-04	Data manipulation, Country flags, Slope graph, Function creation	✔️	🔗	📈
Plants in Danger	2020-08-18	Data manipulation, Web scraping using `rvest` package	✔️	🔗	📈
Chopped	2020-08-25	Data manipulation, Modelling (Linear Regression, Random Forest, and Natural Splines)	✔️	🔗	📈
Global Crop Yields	2020-09-01	Interactive Shiny dashboard	✔️	🔗	📈
Friends	2020-09-08	Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining	✔️	🔗	📈
Government Spending on Kids	2020-09-15	Data Manipulation, Functions, Embracing, Reading in Many .csv Files, Pairwise Correlation	✔️	🔗	📈
Himalayan Climbers	2020-09-22	Data Manipulation, Empirical Bayes, Logistic Regression Model	✔️	🔗	📈
Beyoncé and Taylor Swift Lyrics	2020-09-29	Text analysis, `tf_idf`, Log odds ratio, Diverging bar graph, Lollipop graph	✔️	🔗	📈
NCAA Women's Basketball	2020-10-06	Heatmap, Correlation analysis	✔️	🔗	📈
Great American Beer Festival	2020-10-20	Log odds ratio, Logistic regression, TIE Fighter plot	✔️	🔗	📈
IKEA Furniture	2020-11-03	Linear model, Coefficient/TIE fighter plot, Boxplots, Log scale discussion, Calculating volume	✔️	🔗	📈
Historical Phones	2020-11-10	Joining tables, Animated world choropleth, Adding IQR to geom_line, World development indicators package	✔️	🔗	📈
Riddler: Simulating a Circular Random Walk	2020-11-23	Simulation	✔️	🔗	📈
Ninja Warrior	2020-12-15	Log-odds with `tidylo` package, Graphing with `ggplot2`	✔️	🔗	📈

Individual Screencasts

College Majors and Income

Back to summary

Screencast	Time	Description
College Majors and Income	1:45	Using `read_csv` function to import data directly from Github to R (without cloning the repository)
College Majors and Income	7:20	Creating a histogram (`geom_histogram`), then a boxplot (`geom_boxplot`), to explore the distribution of salaries
College Majors and Income	8:55	Using `fct_reorder` function to sort boxplot of college majors by salary
College Majors and Income	9:35	Using `dollar_format` function from `scales` package to convert scientific notation to dollar format (e.g., "4e+04" becomes "$40,000")
College Majors and Income	14:10	Creating a dotplot (`geom_point`) of 20 top-earning majors (includes adjusting axis, using the colour aesthetic, and adding error bars)
College Majors and Income	17:45	Using `str_to_title` function to convert string from ALL CAPS to Title Case
College Majors and Income	20:45	Creating a Bland-Altman graph to explore relationship between sample size and median salary
College Majors and Income	21:45	Using `geom_text_repel` function from `ggrepel` package to get text labels on scatter plot points
College Majors and Income	28:30	Using `count` function's `wt` argument to specify what should be counted (default is number of rows)
College Majors and Income	30:00	Spicing up a dull bar graph by adding a redundant colour aesthetic (trick from Julia Silge)
College Majors and Income	36:20	Starting to explore relationship between gender and salary
College Majors and Income	37:10	Creating a stacked bar graph (`geom_col`) of gender breakdown within majors
College Majors and Income	40:15	Using `summarise_at` to aggregate men and women from majors into categories of majors
College Majors and Income	45:30	Graphing scatterplot (`geom_point`) of share of women and median salary
College Majors and Income	47:10	Using `geom_smooth` function to add a line of best fit to scatterplot above
College Majors and Income	48:40	Explanation of why not to aggregate first when performing a statistical test (including explanation of Simpson's Paradox)
College Majors and Income	49:55	Fixing `geom_smooth` so that we get one overall line while still being able to map to the colour aesthetic
College Majors and Income	51:10	Predicting median salary from share of women with weighted linear regression (to take sample sizes into account)
College Majors and Income	56:05	Using `nest` function and `tidy` function from the `broom` package to apply a linear model to many categories at once
College Majors and Income	58:05	Using `p.adjust` function to adjust p-values to correct for multiple testing (using FDR, False Discovery Rate)
College Majors and Income	1:04:50	Showing how to add an appendix to an `Rmarkdown` file with code that doesn't run when compiled
College Majors and Income	1:09:00	Using `fct_lump` function to aggregate major categories into the top four and an "Other" category
College Majors and Income	1:10:05	Adding sample size to the size aesthetic within the `aes` function
College Majors and Income	1:10:50	Using `ggplotly` function from `plotly` package to create an interactive scatterplot (tooltips appear when moused over)
College Majors and Income	1:15:55	Exploring IQR (Inter-Quartile Range) of salaries by major

Horror Movie Profits

Back to summary

Screencast	Time	Description
Horror Movie Profits	2:50	Using `parse_date` function from `lubridate` package to convert date formatted as character to date class (should have used mdy function though)
Horror Movie Profits	7:45	Using `fct_lump` function to aggregate distributors into top 6 (by number of movies) and and "Other" category
Horror Movie Profits	8:50	Investigating strange numbers in the data and discovering duplication
Horror Movie Profits	12:40	Using problems function to look at parsing errors when importing data
Horror Movie Profits	14:35	Using `arrange` and `distinct` function and its `.keep_all` argument to de-duplicate observations
Horror Movie Profits	16:10	Using `geom_boxplot` function to create a boxplot of budget by distributor
Horror Movie Profits	19:20	Using `floor` function to bin release years into decades (e.g., "1970" and "1973" both become "1970")
Horror Movie Profits	21:30	Using `summarise_at` function to apply the same function to multiple variables at the same time
Horror Movie Profits	24:10	Using `geom_line` to visualize multiple metrics at the same time
Horror Movie Profits	26:00	Using `facet_wrap` function to graph small multiples of genre-budget boxplots by distributor
Horror Movie Profits	28:35	Starting analysis of profit ratio of movies
Horror Movie Profits	32:50	Using `paste0` function in a custom function to show labels of multiple (e.g., "4X" or "6X" to mean "4 times" or "6 times")
Horror Movie Profits	41:20	Starting analysis of the most common genres over time
Horror Movie Profits	45:55	Starting analysis of the most profitable individual horror movies
Horror Movie Profits	51:45	Using `paste0` function to add release date of movie to labels in a bar graph
Horror Movie Profits	53:25	Using `geom_text` function, along with its `check_overlap` argument, to add labels to some points on a scatterplot
Horror Movie Profits	58:10	Using `ggplotly` function from `plotly` package to create an interactive scatterplot
Horror Movie Profits	1:00:55	Reviewing unexplored areas of investigation

R Downloads

Back to summary

Screencast	Time	Description
R Downloads	5:20	Using `geom_line` function to visualize changes over time
R Downloads	7:35	Starting to decompose time series data into day-of-week trend and overall trend (lots of `lubridate` package functions)
R Downloads	9:50	Using `floor_date` function from `lubridate` package to round dates down to the week level
R Downloads	10:05	Using `min` function to drop incomplete/partial week at the start of the dataset
R Downloads	12:20	Using `countrycode` function from `countrycode` package to replace two-letter country codes with full names (e.g., "CA" becomes "Canada")
R Downloads	17:20	Using `fct_lump` function to get top N categories within a categorical variable and classify the rest as "Other"
R Downloads	20:30	Using `hour` function from `lubridate` package to pull out integer hour value from a datetime variable
R Downloads	22:20	Using `facet_wrap` function to graph small multiples of downloads by country, then changing its `scales` argument to allow different scales on y-axis
R Downloads	31:00	Starting analysis of downloads by IP address
R Downloads	35:20	Using `as.POSIXlt` to combine separate date and time variables to get a single datetime variable
R Downloads	36:35	Using `lag` function to calculate time between downloads (time between events) per IP address (comparable to SQL window function)
R Downloads	38:05	Using `as.numeric` function to convert variable from a time interval object to a numeric variable (number in seconds)
R Downloads	38:40	Explanation of a bimodal log-normal distribution
R Downloads	39:05	Handy trick for setting easy-to-interpret intervals for time data on `scale_x_log10` function's `breaks` argument
R Downloads	47:40	Starting to explore package downloads
R Downloads	52:15	Adding 1 to the numerator and denominator when calculating a ratio to get around dividing by zero
R Downloads	57:55	Showing how to look at package download data over time using `cran_downloads` function from the `cranlogs` package

US Wind Turbines

Back to summary

Screencast	Time	Description
US Wind Turbines	3:50	Using `count` function to explore categorical variables
US Wind Turbines	5:00	Creating a quick-and-dirty map using `geom_point` function and latitude and longitude data
US Wind Turbines	6:10	Explaining need for `mapproj` package when plotting maps in `ggplot2`
US Wind Turbines	7:35	Using `borders` function to add US state borders to map
US Wind Turbines	10:45	Using `fct_lump` function to get the top 6 project categories and put the rest in a lumped "Other" category
US Wind Turbines	11:30	Changing data so that certain categories' points appear in front of other categories' points on the map
US Wind Turbines	14:15	Taking the centroid (average longitude and latitude) of points across a geographic area as a way to aggregate categories to one point
US Wind Turbines	19:40	Using `ifelse` function to clean missing data that is coded as "-9999"
US Wind Turbines	26:00	Asking, "How has turbine capacity changed over time?"
US Wind Turbines	33:15	Exploring different models of wind turbines
US Wind Turbines	38:00	Using `mutate_if` function to find NA values (coded as -9999) in multiple columns and replace them with an actual NA
US Wind Turbines	45:40	Reviewing documentation for `gganimate` package
US Wind Turbines	47:00	Attempting to set up `gganimate` map
US Wind Turbines	48:55	Understanding `gganimate` package using a "Hello World" / toy example, then trying to debug turbine animation
US Wind Turbines	56:45	Using `is.infinite` function to get rid of troublesome Inf values
US Wind Turbines	57:55	Quick hack for getting cumulative data from a table using `crossing` function (though it does end up with some duplication)
US Wind Turbines	1:01:45	Diagnosis of `gganimate` issue (points between integer years are being interpolated)
US Wind Turbines	1:04:35	Pseudo-successful `gganimate` map (cumulative points show up, but some points are missing)
US Wind Turbines	1:05:40	Summary of screencast

Malaria Incidence

Back to summary

Screencast	Time	Description
Malaria Incidence	2:45	Importing data using the `malariaAtlas` package
Malaria Incidence	14:10	Using `geom_line` function to visualize malaria prevalence over time
Malaria Incidence	15:10	Quick map visualization using longitude and latitude coordinates and the `geom_point` function
Malaria Incidence	18:40	Using `borders` function to add Kenyan country borders to map
Malaria Incidence	19:50	Using `scale_colour_gradient2` function to change the colour scale of points on the map
Malaria Incidence	20:40	Using `arrange` function to ensure that certain points on a map appear in front of/behind other points
Malaria Incidence	21:50	Aggregating data into decades using the truncated division operator `%/%`
Malaria Incidence	24:45	Starting to look at aggregated malaria data (instead of country-specific data)
Malaria Incidence	26:50	Using `sample` and `unique` functions to randomly select a few countries, which are then graphed
Malaria Incidence	28:30	Using `last` function to select the most recent observation from a set of arranged data
Malaria Incidence	32:55	Creating a Bland-Altman plot to explore relationship between current incidence and change in incidence in past 15 years
Malaria Incidence	35:45	Using `anti_join` function to find which countries are not in the malaria dataset
Malaria Incidence	36:40	Using the `iso3166` dataset set in the `maps` package to match three-letter country code (i.e., the ISO 3166 code) with country names
Malaria Incidence	38:30	Creating a world map using `geom_polygon` function (and eventually `theme_void` and `coord_map` functions)
Malaria Incidence	39:00	Getting rid of Antarctica from world map
Malaria Incidence	42:35	Using `facet_wrap` function to create small multiples of world map for different time periods
Malaria Incidence	47:30	Starting to create an animated map of malaria deaths (actual code writing starts at 57:45)
Malaria Incidence	51:25	Starting with a single year after working through some bugs
Malaria Incidence	52:10	Using `regex_inner_join` function from the `fuzzyjoin` package to join map datasets because one of them has values in regular expressions
Malaria Incidence	55:15	As alternative to `fuzzyjoin` package in above step, using `str_remove` function to get rid of unwanted regex
Malaria Incidence	57:45	Starting to turn static map into an animation using `gganimate` package
Malaria Incidence	1:02:00	The actual animated map
Malaria Incidence	1:02:35	Using `countrycode` package to filter down to countries in a specific continent (Africa, in this case)
Malaria Incidence	1:03:55	Summary of screencast

Thanksgiving Dinner

Back to summary

Screencast	Time	Description
Thanksgiving Dinner	4:10	Exploratory bar chart of age distribution (and gender) of survey respondents
Thanksgiving Dinner	7:40	Using `count` function on multiple columns to get detailed counts
Thanksgiving Dinner	11:25	Parsing numbers from text using `parse_number` function, then using those numbers to re-level an ordinal factor (income bands)
Thanksgiving Dinner	13:05	Exploring relationship between income and using homemade (vs. canned) cranberry sauce
Thanksgiving Dinner	14:00	Adding group = 1 argument to the `aes` function to properly display a line chart
Thanksgiving Dinner	14:30	Rotating text for axis labels that overlap
Thanksgiving Dinner	16:50	Getting confidence intervals for proportions using Jeffreys interval (using beta distribution with an uniformative prior)
Thanksgiving Dinner	17:55	Explanation of Clopper-Pearson approach as alternative to Jeffreys interval
Thanksgiving Dinner	18:30	Using `geom_ribbon` function add shaded region to line chart that shows confidence intervals
Thanksgiving Dinner	21:55	Using `starts_with` function to select fields with names that start with a certain string (e.g., using "pie" selects "pie1" and "pie2")
Thanksgiving Dinner	22:55	Using `gather` function to get wide-format data to tidy (tall) format
Thanksgiving Dinner	23:45	Using `str_remove` and regex to remove digits from field values (e.g., "dessert1" and "dessert2" get turned into "dessert")
Thanksgiving Dinner	27:00	"What are people eating?" Graphing pies, sides, and desserts
Thanksgiving Dinner	28:00	Using `fct_reorder` function to reorder foods based on how popular they are
Thanksgiving Dinner	28:45	Using `n_distinct` function count the number of unique respondents
Thanksgiving Dinner	30:25	Using `facet_wrap` function to facet food types into their own graphs
Thanksgiving Dinner	32:50	Using `parse_number` function to convert age ranges as character string into a numeric field
Thanksgiving Dinner	35:35	Exploring relationship between US region and food types
Thanksgiving Dinner	36:15	Using `group_by`, then `mutate`, then `count` to calculate a complicated summary
Thanksgiving Dinner	40:35	Exploring relationship between praying at Thanksgiving (yes/no) and food types
Thanksgiving Dinner	42:30	Empirical Bayes binomial estimation for calculating binomial confidence intervals (see Dave's book on Empirical Bayes)
Thanksgiving Dinner	45:30	Asking, "What sides/desserts/pies are eaten together?"
Thanksgiving Dinner	46:20	Calculating pairwise correlation of food types
Thanksgiving Dinner	49:05	Network graph of pairwise correlation
Thanksgiving Dinner	51:40	Adding text labels to nodes using `geom_node_text` function
Thanksgiving Dinner	53:00	Getting rid of unnecessary graph elements (e.g., axes, gridlines) with `theme_void` function
Thanksgiving Dinner	53:25	Explanation of network graph relationships
Thanksgiving Dinner	55:05	Adding dimension to network graph (node colour) to represent the type of food
Thanksgiving Dinner	57:45	Fixing overlapping text labels using the `geom_node_text` function's repel argument
Thanksgiving Dinner	58:55	Tweaking display of percentage legend to be in more readable format (e.g., "40%" instead of "0.4")
Thanksgiving Dinner	1:00:05	Summary of screencast

Maryland Bridges

Back to summary

Screencast	Time	Description
Maryland Bridges	9:15	Using `geom_line` to create an exploratory line graph
Maryland Bridges	10:10	Using `%/%` operator (truncated division) to bin years into decades (e.g., 1980, 1984, and 1987 would all become "1980")
Maryland Bridges	12:30	Converting two-digit year to four-digit year (e.g., "16" becomes "2016") by adding 2000 to each one
Maryland Bridges	15:40	Using `percent_format` function from `scales` package to get nice-looking axis labels
Maryland Bridges	19:55	Using `geom_col` to create an ordered nice bar/column graph
Maryland Bridges	21:35	Using `replace_na` to replace NA values with "Other"
Maryland Bridges	27:15	Starting exploration of average daily traffic
Maryland Bridges	29:05	Using `comma_format` function from `scales` package to get more readable axis labels (e.g., "1e+05" becomes "100,000")
Maryland Bridges	31:15	Using `cut` function to bin continuous variable into customized breaks (also does a `mutate` within a `group_by`!)
Maryland Bridges	34:30	Starting to make a map
Maryland Bridges	37:00	Encoding a continuous variable to colour, then using `scale_colour_gradient2` function to specify colours and midpoint
Maryland Bridges	38:20	Specifying the `trans` argument (transformation) of the `scale_colour_gradient2` function to get a logarithmic scale
Maryland Bridges	45:55	Using `str_to_title` function to get values to Title Case (first letter of each word capitalized)
Maryland Bridges	48:35	Predicting whether bridges are in "Good" condition using logistic regression (remember to specify the family argument! Dave fixes this at 52:54)
Maryland Bridges	50:30	Explanation of why we should NOT be using an OLS linear regression
Maryland Bridges	51:10	Using the `augment` function from the `broom` package to illustrate why a linear model is not a good fit
Maryland Bridges	52:05	Specifying the `type.predict` argument in the `augment` function so that we get the actual predicted probability
Maryland Bridges	54:40	Explanation of why the sigmoidal shape of logistic regression can be a drawback
Maryland Bridges	55:05	Using a cubic spline model (a type of GAM, Generalized Additive Model) as an alternative to logistic regression
Maryland Bridges	56:00	Explanation of the shape that a cubic spline model can take (which logistic regression cannot)
Maryland Bridges	1:02:15	Visualizing the model in a different way, using a coefficient plot
Maryland Bridges	1:04:35	Using `geom_vline` function to add a red reference line to a graph
Maryland Bridges	1:04:50	Adding confidence intervals to the coefficient plot by specifying `conf.int` argument of `tidy` function and graphing using the `geom_errorbarh` function
Maryland Bridges	1:05:35	Brief explanation of log-odds coefficients
Maryland Bridges	1:09:10	Summary of screencast

Medium Articles

Back to summary

Screencast	Time	Description
Medium Articles	5:40	Using `summarise_at` and `starts_with` functions to quickly sum up all variables starting with "tag_"
Medium Articles	6:55	Using `gather` function (now `pivot_longer`) to convert topic tag variables from wide to tall (tidy) format
Medium Articles	8:10	Explanation of how gathering step above will let us find the most/least common tags
Medium Articles	9:00	Explanation of using `median` (instead of `mean`) as measure of central tendency for number of claps an article got
Medium Articles	9:50	Visualizing log-normal (ish) distribution of number of claps an article gets
Medium Articles	12:05	Using `pmin` function to bin reading times of 10 minutes or more to cap out at 10 minutes
Medium Articles	12:35	Changing `scale_x_continuous` function's `breaks` argument to get custom labels and tick marks on a histogram
Medium Articles	14:35	Discussion of using mean vs. median as measure of central tendency for reading time (he decides on mean)
Medium Articles	16:00	Starting text mining analysis
Medium Articles	16:40	Using `unnest_tokens` function from `tidytext` package to split character string into individual words
Medium Articles	17:50	Explanation of stop words and using `anti_join` function to get rid of them
Medium Articles	20:20	Using `str_detect` function to filter out "words" that are just numbers (e.g., "2", "35")
Medium Articles	22:35	Quick analysis of which individual words are associated with more/fewer claps ("What are the hype words?")
Medium Articles	25:15	Using geometric mean as alternative to median to get more distinction between words (note 27:33 where he makes a quick fix)
Medium Articles	28:10	Starting analysis of clusters of related words (e.g., "neural" is linked to "network")
Medium Articles	30:30	Finding correlations pairs of words using `pairwise_cor` function from `widyr` package
Medium Articles	34:00	Using `ggraph` and `igraph` packages to make network plot of correlated pairs of words
Medium Articles	35:00	Using `geom_node_text` to add labels for points (vertices) in the network plot
Medium Articles	38:40	Filtering original data to only include words appear in the network plot (150 word pairs with most correlation)
Medium Articles	40:10	Adding colour as a dimension to the network plot, representing geometric mean of claps
Medium Articles	40:50	Changing default colour scale to one with Blue = Low and High = Red with `scale_colour_gradient2` function
Medium Articles	43:15	Adding dark outlines to points on network plot with a hack
Medium Articles	44:45	Starting to predict number of claps based on title tag (Lasso regression)
Medium Articles	45:50	Explanation of data format needed to conduct Lasso regression (and using `cast_sparse` function to get sparse matrix)
Medium Articles	47:45	Bringing in number of claps to the sparse matrix (un-tidy methods)
Medium Articles	49:00	Using `cv.glmnet` function (cv = cross validated) from `glmnet` package to run Lasso regression
Medium Articles	49:55	Finding and fixing mistake in defining Lasso model
Medium Articles	51:05	Explanation of Lasso model
Medium Articles	52:35	Using `tidy` function from the `broom` package to tidy up the Lasso model
Medium Articles	54:35	Visualizing how specific words affect the prediction of claps as lambda (Lasso's penalty parameter) changes
Medium Articles	1:00:20	Summary of screencast

Riddler: Monte Carlo Simulation

Back to summary

Screencast	Time	Description
Riddler: Monte Carlo Simulation	3:10	Using `crossing` function to set up structure of simulation (1,000 trials, each with 12 chess games)
Riddler: Monte Carlo Simulation	4:00	Adding result to the tidy simulation dataset
Riddler: Monte Carlo Simulation	6:45	Using `sample` function to simulate win/loss/draw for each game (good explanation of individual arguments within sample)
Riddler: Monte Carlo Simulation	7:05	Using `group_by` and `summarise` to get total points for each trial
Riddler: Monte Carlo Simulation	8:10	Adding red vertical reference line to histogram to know when a player wins a matchup
Riddler: Monte Carlo Simulation	10:00	Answering second piece of riddle (how many games would need to be played for better player to win 90% or 99% of the time?)
Riddler: Monte Carlo Simulation	10:50	Using `unnest` and `seq_len` functions to create groups of number of games (20, 40, …, 100), each with one game per row
Riddler: Monte Carlo Simulation	12:15	Creating a win field based on the simulated data, then summarising win percentage for each group of number of games (20, 40, …, 100)
Riddler: Monte Carlo Simulation	13:55	Using `seq` function to create groups of number of games programmatically
Riddler: Monte Carlo Simulation	15:05	Explanation of using logarithmic scale for this riddle
Riddler: Monte Carlo Simulation	15:45	Changing spacing of number of games from even spacing (20, 40, …, 100) to exponential (doubles every time, 12, 24, 48, …, 1536)
Riddler: Monte Carlo Simulation	18:00	Changing spacing of number of games to be finer
Riddler: Monte Carlo Simulation	19:00	Introduction of interpolation as the last step we will do
Riddler: Monte Carlo Simulation	19:30	Introducing `approx` function as method to linearly interpolate data
Riddler: Monte Carlo Simulation	22:35	Break point for the next riddle
Riddler: Monte Carlo Simulation	24:30	Starting recursive approach to this riddle
Riddler: Monte Carlo Simulation	25:35	Setting up a N x N matrix (N = 4 to start)
Riddler: Monte Carlo Simulation	25:55	Explanation of approach (random ball goes into random cup, represented by matrix)
Riddler: Monte Carlo Simulation	26:25	Using `sample` function to pick a random element of the matrix
Riddler: Monte Carlo Simulation	27:15	Using for loop to iterate random selection 100 times
Riddler: Monte Carlo Simulation	28:25	Converting for loop to while loop, using `colSums` to keep track of number of balls in cups
Riddler: Monte Carlo Simulation	30:05	Starting to code the pruning phase
Riddler: Monte Carlo Simulation	30:15	Using `diag` function to pick matching matrix elements (e.g., the 4th row of the 4th column)
Riddler: Monte Carlo Simulation	31:50	Turning code up to this point into a custom simulate_round function
Riddler: Monte Carlo Simulation	32:25	Using custom simulate_round function to simulate 100 rounds
Riddler: Monte Carlo Simulation	33:30	Using `all` function to perform logic check on whether all cups in a round are not empty
Riddler: Monte Carlo Simulation	34:05	Converting loop approach to tidy approach
Riddler: Monte Carlo Simulation	35:10	Using `rerun` and `map_lgl` functions from `purrr` to simulate a round for each for in a dataframe
Riddler: Monte Carlo Simulation	36:20	Explanation of the tidy approach
Riddler: Monte Carlo Simulation	37:05	Using `cumsum` and `lag` functions to keep track of the number of rounds until you win a "game"
Riddler: Monte Carlo Simulation	39:45	Creating histogram of number of rounds until winning a game
Riddler: Monte Carlo Simulation	40:10	Setting boundary argument of `geom_histogram` function to include count of zeros
Riddler: Monte Carlo Simulation	40:30	Brief explanation of geometric distribution
Riddler: Monte Carlo Simulation	41:25	Extending custom simulate_round function to include number of balls thrown to win (in addition to whether we won a round)
Riddler: Monte Carlo Simulation	46:10	Extending to two values of N (N = 3 or N = 4)
Riddler: Monte Carlo Simulation	49:50	Reviewing results of N = 3 and N = 4
Riddler: Monte Carlo Simulation	52:20	Extending to N = 5
Riddler: Monte Carlo Simulation	53:55	Checking results of chess riddle with Riddler solution
Riddler: Monte Carlo Simulation	55:10	Checking results of ball-cup riddle with Riddler solution (Dave slightly misinterpreted what the riddle was asking)
Riddler: Monte Carlo Simulation	56:35	Changing simulation code to correct the misinterpretation
Riddler: Monte Carlo Simulation	1:01:40	Reviewing results of corrected simulation
Riddler: Monte Carlo Simulation	1:03:30	Checking results of ball-cup riddle with corrected simulation with Riddler solutions
Riddler: Monte Carlo Simulation	1:06:00	Visualizing number of balls thrown and rounds played

NYC Restaurant Inspections

Back to summary

Screencast	Time	Description
NYC Restaurant Inspections	18:45	Separating column using `separate` function
NYC Restaurant Inspections	21:15	Taking distinct observations, but keeping the remaining variables using `distinct` function with .keep_all argument
NYC Restaurant Inspections	25:00	Using `broom` package and `nest` function to perform multiple t-tests at the same time
NYC Restaurant Inspections	26:20	Tidying nested t-test models using `broom`
NYC Restaurant Inspections	27:00	Creating TIE fighter plot of estimates of means and their confidence intervals
NYC Restaurant Inspections	28:45	Recode long description using regex to remove everything after a parenthesis
NYC Restaurant Inspections	33:45	Using `cut` function to manually bin data along user-specified intervals
NYC Restaurant Inspections	42:00	Asking, "What type of violations tend to occur more in some cuisines than others?"
NYC Restaurant Inspections	42:45	Using `semi_join` function to get the most recent inspection of all the restaurants
NYC Restaurant Inspections	52:00	Asking, "What violations tend to occur together?"
NYC Restaurant Inspections	53:00	Using `widyr` package function `pairwise_cor` (pairwise correlation) to find co-occurrence of violation types
NYC Restaurant Inspections	55:30	Beginning of PCA (Principal Component Analysis) using `widely_svd` function
NYC Restaurant Inspections	58:00	Actually typing in the `widely_svd` function
NYC Restaurant Inspections	58:15	Reviewing and explaining output of `widely_svd` function
NYC Restaurant Inspections	1:01:30	Creating graph of opposing elements of a PCA dimension
NYC Restaurant Inspections	1:02:00	Shortening string using `str_sub` function
NYC Restaurant Inspections	1:04:00	Reference to Julia Silge's PCA walkthrough using StackOverflow data

Riddler: Simulating a Week of Rain

Back to summary

Screencast	Time	Description
Riddler: Simulating a Week of Rain	1:20	Using `crossing` function to get all combinations of specified variables (100 trials of 5 days)
Riddler: Simulating a Week of Rain	2:35	Using `rbinom` function to simulate whether it rains or not
Riddler: Simulating a Week of Rain	3:15	Using `ifelse` function to set starting number of umbrellas at beginning of week
Riddler: Simulating a Week of Rain	4:20	Explanation of structure of simulation and approach to determining number of umbrellas in each location
Riddler: Simulating a Week of Rain	5:30	Changing structure so that we have a row for each day's morning or evening
Riddler: Simulating a Week of Rain	7:10	Using `group_by`, `ifelse`, and `row_number` functions to set starting number of umbrellas for each trial
Riddler: Simulating a Week of Rain	8:45	Using `case_when` function to returns different values for multiple logical checks (allows for more outputs than ifelse)
Riddler: Simulating a Week of Rain	10:20	Using `cumsum` function to create a running tally of number of umbrellas in each location
Riddler: Simulating a Week of Rain	11:25	Explanation of output of simulated data
Riddler: Simulating a Week of Rain	12:30	Using `any` function to check if any day had a negative "umbrella count" (indicating there wasn't an umbrella available when raining)
Riddler: Simulating a Week of Rain	15:40	Asking, "When was the first time Louie got wet?"
Riddler: Simulating a Week of Rain	17:10	Creating a custom vector to convert an integer to a weekday (e.g., 2 = Tue)

Dolphins

Back to summary

Screencast	Time	Description
Dolphins	6:25	Using `year` function from `lubridate` package to simplify calculating age of dolphins
Dolphins	8:30	Combining `count` and `fct_lump` functions to get counts of top 5 species (with other species lumped in "Other")
Dolphins	9:55	Creating boxplot of species and age
Dolphins	11:50	Dealing with different types of NA (double, logical) (he doesn't get it in this case, but it's still useful)
Dolphins	15:30	Adding acquisition type as colour dimension to histogram
Dolphins	16:00	Creating a spinogram of acquisition type over time (alternative to histogram) using `geom_area`
Dolphins	17:25	Binning year into decade using truncated division operator `%/%`
Dolphins	19:10	Fixing annoying triangular gaps in spinogram using complete function to fill in gaps in data
Dolphins	21:15	Using `fct_reorder` function to reorder acquisition type (bigger categories are placed on the bottom of the spinogram)
Dolphins	23:25	Adding vertical dashed reference line using `geom_vline` function
Dolphins	24:05	Starting analysis of acquisition location
Dolphins	27:05	Matching messy text data with regex to aggregate into a few categories variables with `fuzzyjoin` package
Dolphins	31:30	Using `distinct` function's .keep_all argument to keep only one row per animal ID
Dolphins	33:10	Using `coalesce` function to conditionally replace NAs (same functionality as SQL verb)
Dolphins	40:00	Starting survival analysis
Dolphins	46:25	Using `survfit` function from `survival` package to get a baseline survival curve (i.e., not regressed on any independent variables)
Dolphins	47:30	Fixing cases where death year is before birth year
Dolphins	48:30	Fixing specification of survfit model to better fit the format of our data (right-censored data)
Dolphins	50:10	Built-in plot of baseline survival model (estimation of percentage survival at a given age)
Dolphins	50:30	Using `broom` package to tidy the survival model data (which is better for `ggplot2` plotting)
Dolphins	52:20	Fitting survival curve based on sex
Dolphins	54:25	Cox proportional hazards model (to investigate association of survival time and one or more predictors)
Dolphins	55:50	Explanation of why dolphins with unknown sex likely have a systematic bias with their data
Dolphins	57:25	Investigating whether being born in captivity is associated with different survival rates
Dolphins	1:00:10	Summary of screencast

TidyTuesday Tweets

Back to summary

Screencast	Time	Description
TidyTuesday Tweets	1:20	Importing an rds file using `read_rds` function
TidyTuesday Tweets	2:55	Using `floor_date` function from `lubridate` package to round dates down (that's what the floor part does) to the month level
TidyTuesday Tweets	5:25	Asking, "Which tweets get the most re-tweets?"
TidyTuesday Tweets	5:50	Using `contains` function to select only columns that contain a certain string ("retweet" in this case)
TidyTuesday Tweets	8:05	Exploring likes/re-tweets ratio, including dealing with one or the other being 0 (which would cause divide by zero error)
TidyTuesday Tweets	11:00	Starting exploration of actual text of tweets
TidyTuesday Tweets	11:35	Using `unnest_tokens` function from `tidytext` package to break tweets into individual words (using token argument specifically for tweet-style text)
TidyTuesday Tweets	12:55	Using `anti_join` function to filter out stop words (e.g., "and", "or", "the") from tokenized data frame
TidyTuesday Tweets	14:45	Calculating summary statistics per word (average retweets and likes), then looking at distributions
TidyTuesday Tweets	16:00	Explanation of Poisson log normal distribution (number of retweets fits this distribution)
TidyTuesday Tweets	17:45	Additional example of Poisson log normal distribution (number of likes)
TidyTuesday Tweets	18:20	Explanation of geometric mean as better summary statistic than median or arithmetic mean
TidyTuesday Tweets	25:20	Using `floor_date` function from `lubridate` package to floor dates to the week level and tweaking so that a week starts on Monday (default is Sunday)
TidyTuesday Tweets	30:20	Asking, "What topic is each week about?" using just the tweet text
TidyTuesday Tweets	31:30	Calculating TF-IDF of tweets, with week as the "document"
TidyTuesday Tweets	33:45	Using `top_n` and `group_by` functions to select the top tf-idf score for each week
TidyTuesday Tweets	37:55	Using `str_detect` function to filter out "words" that are just numbers (e.g., 16, 36)
TidyTuesday Tweets	41:00	Using `distinct` function with .keep_all argument to ensure only top 1 result, as alternative to `top_n` function (which includes ties)
TidyTuesday Tweets	42:30	Making Jenny Bryan disappointed
TidyTuesday Tweets	42:55	Using `geom_text` function to add text labels to graph to show to word associated with each week
TidyTuesday Tweets	44:10	Using `geom_text_repel` function from `ggrepel` package as an alternative to `geom_text` function for adding text labels to graph
TidyTuesday Tweets	46:30	Using `rvest` package to scrape web data from a table in Tidy Tuesday README
TidyTuesday Tweets	51:00	Starting to look at #rstats tweets
TidyTuesday Tweets	56:35	Spotting signs of fake accounts with purchased followers (lots of hashtags)
TidyTuesday Tweets	59:15	Explanation of spotting fake accounts
TidyTuesday Tweets	1:00:45	Using `str_detect` to filter out web URLs
TidyTuesday Tweets	1:03:55	Using `str_count` function and some regex to count how many hashtags a tweet has
TidyTuesday Tweets	1:07:25	Creating a Bland-Altman plot (total on x-axis, variable of interest on y-axis)
TidyTuesday Tweets	1:08:45	Using `geom_text` function with check_overlap argument to add labels to scatterplot
TidyTuesday Tweets	1:12:20	Asking, "Who are the most active #rstats tweeters?"
TidyTuesday Tweets	1:15:00	Summary of screncast

TV Golden Age

Back to summary

Screencast	Time	Description
TV Golden Age	2:25	Quick tip on how to start exploring a new dataset
TV Golden Age	7:30	Investigating inconsistency of shows having a count of seasons that is different from the number of seasons given in the data
TV Golden Age	10:10	Using `%in%` operator and `all` function to only get shows that have a first season and don't have skipped seasons in the data
TV Golden Age	15:30	Asking, "Which seasons have the most variation in ratings?"
TV Golden Age	20:25	Using `facet_wrap` function to separate different shows on a line graph into multiple small graphs
TV Golden Age	20:50	Writing custom embedded function to get width of breaks on the x-axis to always be even (e.g., season 2, 4, 6, etc.)
TV Golden Age	23:50	Committing, finding, and explaining a common error of using the same variable name when summarizing multiple things
TV Golden Age	28:20	Using truncated division operator `%/%` to bin data into two-year bins instead of annual (e.g., 1990 and 1991 get binned to 1990)
TV Golden Age	31:30	Using subsetting (with square brackets) within the `mutate` function to calculate mean on only a subset of data (without needing to filter)
TV Golden Age	33:50	Using `gather` function (now `pivot_longer`) to get metrics as columns into tidy format, in order to graph them all at once with a `facet_wrap`
TV Golden Age	36:30	Using `pmin` function to lump all seasons after 4 into one row (it still shows "4", but it represents "4+")
TV Golden Age	39:00	Asking, "If season 1 is good, do you get a second season?" (show survival)
TV Golden Age	40:35	Using `paste0` and `spread` functions to get season 1-3 ratings into three columns, one for each season
TV Golden Age	42:05	Using `distinct` function with `.keep_all` argument remove duplicates by only keeping the first one that appears
TV Golden Age	45:50	Using logistic regression to answer, "Does season 1 rating affect the probability of getting a second season?" (note he forgets to specify the family argument, fixed at 57:25)
TV Golden Age	48:35	Using `ntile` function to divide data into N bins (5 in this case), then eventually using `cut` function instead
TV Golden Age	57:00	Adding year as an independent variable to the logistic regression model
TV Golden Age	58:50	Adding an interaction term (season 1 interacting with year) to the logistic regression model
TV Golden Age	59:55	Using `augment` function as a method of visualizing and interpreting coefficients of regression model
TV Golden Age	1:00:30	Using `crossing` function to create new data to test the logistic regression model on and interpret model coefficients
TV Golden Age	1:03:40	Fitting natural splines using the `splines` package, which would capture a non-linear relationship
TV Golden Age	1:06:15	Summary of screencast

Space Launches

Back to summary

Screencast	Time	Description
Space Launches	4:40	Using `str_detect` function to find missions with "Apollo" in their name
Space Launches	6:20	Starting EDA (exploratory data analysis)
Space Launches	15:10	Using `fct_collapse` function to recode factors (similar to `case_when` function)
Space Launches	16:45	Using `countrycode` function from `countrycode` package to get full country names from country codes (e.g. "RU" becomes "Russia")
Space Launches	18:15	Using `replace_na` function to convert NA (missing) observations to "Other"
Space Launches	19:10	Creating a line graph using `geom_line` function with different colours for different categories
Space Launches	21:05	Using `fct_reorder` function to reorder factors in line graph above, in order to make legend more readable
Space Launches	32:00	Creating a bar graph, using `geom_col` function, of most active (by number of launches) private or startup agencies
Space Launches	35:05	Using truncated division operator `%/%` to bin data into decades
Space Launches	35:35	Using `complete` function to turn implicit zeros into explicit zeros (makes for a cleaner line graph)
Space Launches	37:15	Using `facet_wrap` function to create small multiples of a line graph, then proceeding to tweak the graph
Space Launches	42:50	Using `semi_join` function as a filtering step
Space Launches	43:15	Using `geom_point` to create a timeline of launches by vehicle type
Space Launches	47:20	Explanation of why boxplots over time might not be a good visualization choice
Space Launches	48:00	Using `geom_jitter` function to tweak the timeline graph to be more readable
Space Launches	51:30	Creating a second timeline graph for US vehicles and launches
Space Launches	56:35	Summary of screencast

US Incarceration

Back to summary

Screencast	Time	Description
US Incarceration	4:30	Creating a facetted (small multiples) line graph of incarceration rate by urbanicity and race over time
US Incarceration	7:45	Discussion of statistical testing of incarceration rates by urbanicity (e.g., rural, suburban)
US Incarceration	11:25	Exploring the extent of missing data on prison population
US Incarceration	14:15	Using `any` function to filter down to states that have at least one (hence the any function) row of non-missing data
US Incarceration	18:40	Using `cut` function to manually bin data along user-specified intervals
US Incarceration	24:15	Starting to create a choropleth map of incarceration rate by state
US Incarceration	26:20	Using `match` function to match two-letter state abbreviation to full state name, in order to get data needed to create a map
US Incarceration	28:00	Actually typing the code (now that we have the necessary data) to create a choropleth map
US Incarceration	33:05	Using `str_remove` function and regex to chop off the end of county names (e.g., "Allen Parish" becomes "Allen")
US Incarceration	33:30	Making choropleth more specific by drilling down to county-level data
US Incarceration	41:10	Starting to make an animated choropleth map using `gganimate` package
US Incarceration	42:20	Using modulo operator `%%` to choose every 5th year
US Incarceration	43:45	Using `scale_fill_gradient2` function's `limits` argument to exclude unusally high values that were blowing out the scale
US Incarceration	48:15	Using `summarise_at` function to apply the same function to multiple fields at the same time
US Incarceration	50:10	Starting to investigate missing data (how much is missing, where is it missing, etc.)
US Incarceration	54:50	Creating a line graph that excludes counties with missing data
US Incarceration	57:05	Summary of screencast

US Dairy Consumption

Back to summary

Screencast	Time	Description
US Dairy Consumption	2:50	Identifying the need for a gather step
US Dairy Consumption	4:40	Changing snake case to title case using `str_to_title` and `str_replace_all` functions
US Dairy Consumption	6:20	Identifying need for separating categories into major and minor categories (e.g., "Cheese Other" can be divided into "Cheese" and "Other")
US Dairy Consumption	7:10	Using `separate` function to split categories into major and minor categories (good explanation of "extra" argument, which merges additional separations into one field)
US Dairy Consumption	8:20	Using `coalesce` function to deal with NAs resulting from above step
US Dairy Consumption	10:30	Dealing with graph of minor category that is linked to multiple major categories ("Other" linked to "Cheese" and "Frozen")
US Dairy Consumption	13:10	Introducing `fct_lump` function as an approach to work with many categories
US Dairy Consumption	14:50	Introducing facetting (`facet_wrap` function) as second alternative to working with many categories
US Dairy Consumption	15:50	Dealing with "Other" category having two parts to it by using `ifelse` function in the cleaning step (e.g., go from "Other" to "Other Cheese")
US Dairy Consumption	19:45	Looking at page for the `sweep` package
US Dairy Consumption	21:20	Using `tk_ts` function to coerce a tibble to a timeseries
US Dairy Consumption	22:10	Turning year column (numeric) into a date by adding number of years to Jan 1, 0001
US Dairy Consumption	26:00	Nesting time series object into each combination of category and product
US Dairy Consumption	27:50	Applying ETS (Error, Trend, Seasonal) model to each time series
US Dairy Consumption	28:10	Using `sw_glance` function (`sweep` package's version of `glance` function) to pull out model parameters from model field created in above step
US Dairy Consumption	29:45	Using `sw_augment` function to append fitted values and residuals from the model to the original data
US Dairy Consumption	30:50	Visualising actual and fitted values on the same graph to get a look at the ETS model
US Dairy Consumption	32:10	Using `Arima` function (note the capital A) as alternative to ETS (not sure what difference is between `arima` and `Arima`)
US Dairy Consumption	35:00	Forecasting into the future using an ETS model using various functions: `unnest`, `sw_sweep`, `forecast`
US Dairy Consumption	37:45	Using `geom_ribbon` function to add confidence bounds to forecast
US Dairy Consumption	40:20	Forecasting using auto-ARIMA (instead of ETS)
US Dairy Consumption	40:55	Applying two forecasting methods at the same time (auto-ARIMA and ETS) using the `crossing` function
US Dairy Consumption	41:55	Quick test of how `invoke` function works (used to call a function easily, e.g., when it is a character string instead of called directly)
US Dairy Consumption	47:35	Removing only one part of legend (line type of solid or dashed) using `scale_linetype_discrete` function
US Dairy Consumption	51:25	Using `gather` function to clean up new dataset
US Dairy Consumption	52:05	Using `fct_recode` to fix a typo in a categorical variable
US Dairy Consumption	56:00	Copy-pasting previous forecasting code to cheese and reviewing any changes needed
US Dairy Consumption	57:20	Discussing alternative approach: creating interactive visualisation using `shiny` package to do direct comparisons

US PhDs

Back to summary

Screencast	Time	Description
US PhDs	3:15	Using `read_xlsx` function to read in Excel spreadsheet, including skipping first few rows that don't have data
US PhDs	7:25	Overview of starting very messy data
US PhDs	8:20	Using `gather` function to clean up wide dataset
US PhDs	9:20	Using `fill` function to fill in NA values with a entries in a previous observation
US PhDs	10:10	Cleaning variable that has number and percent in it, on top of one another using a combination of `ifelse` and `fill` functions
US PhDs	12:00	Using `spread` function on cleaned data to separate number and percent by year
US PhDs	13:50	Spotted a mistake where he had the wrong string on `str_detect` function
US PhDs	16:50	Using `sample` function to get 6 random fields of study to graph
US PhDs	18:50	Cleaning another dataset, which is much easier to clean
US PhDs	19:05	Renaming the first field, even without knowing the exact name
US PhDs	21:55	Cleaning another dataset
US PhDs	23:10	Discussing challenge of when indentation is used in original dataset (for group / sub-group distinction)
US PhDs	25:20	Starting to separate out data that is appended to one another in the original dataset (all, male, female)
US PhDs	27:30	Removing field with long name using `contains` function
US PhDs	28:10	Using `fct_recode` function to rename an oddly-named category in a categorical variable (`ifelse` function is probably a better alternative)
US PhDs	35:30	Discussing solution to broad major field description and fine major field description (meaningfully indented in original data)
US PhDs	39:40	Using `setdiff` function to separate broad and fine major fields

French Train Delays

Back to summary

Screencast	Time	Description
French Train Delays	10:20	Boxplots of departure stations using `fct_lump` function
French Train Delays	14:25	Creating heat map of departure and arrival delays, then cleaning up a sparse heat map
French Train Delays	15:30	Using `fct_reorder` function and length function to reorder stations based on how frequently they appear
French Train Delays	16:30	Using `fct_infreq` to reorder based on infrequently-appearing stations (same as above, but without a trick needed)
French Train Delays	17:45	Using `fct_lump` function to lump based on proportion instead of number of top categories desired
French Train Delays	18:45	Using `scale_fill_gradient2` function to specify diverging colour scale
French Train Delays	26:00	Checking another person's take on the data, which is a heatmap over time
French Train Delays	28:40	Converting year and month (as digits) into date-class variable using `sprintf` function and padding month number with extra zero when necessary
French Train Delays	34:50	Using `summarise_at` function to quickly sum multiple columns
French Train Delays	39:35	Creating heatmap using `geom_tile` function for percentage of late trains by station over time
French Train Delays	45:05	Using `fill` function to fill in missing NA values with data from previous observations
French Train Delays	50:35	Grouping multiple variables into a single category using `paste0` function
French Train Delays	51:40	Grouping heatmap into International / National chunks with a weird hack
French Train Delays	52:20	Further separating International / National visually
French Train Delays	53:30	Less hacky way of separating International / National (compared to previous two rows)

Women in the Workplace

Back to summary

Screencast	Time	Description
Women in the Workplace	5:50	Writing a custom function that summarizes variables based on their names (then abandoning the idea)
Women in the Workplace	9:15	Using `complete.cases` function to find observations that have an NA value in any variable
Women in the Workplace	9:50	Using subsetting within a `summarise` function to calculate a weighted mean when dealing with 0 or NA values in some observations
Women in the Workplace	12:20	Debugging what is causing NA values to appear in the summarise output (finds the error at 13:25)
Women in the Workplace	17:50	Hypothesizing about one sector illustrating a variation of Simpson's Paradox
Women in the Workplace	25:25	Creating a scatterplot with a logarithmic scale and using `scale_colour_gradient2` function to encode data to point colour
Women in the Workplace	30:00	Creating an interactive plot (tooltips show up on hover) using `ggplotly` function from `plotly` package
Women in the Workplace	33:20	Fiddling with `scale_size_continuous` function's `range` argument to specify point size on a scatterplot (which are encoded to total workers)
Women in the Workplace	34:50	Explanation of why healthcare sector is a good example of Simpson's Paradox
Women in the Workplace	43:15	Starting to create a `shiny` app with "occupation" as only input (many tweaks in subsequent minutes to make it work)
Women in the Workplace	47:55	Tweaking size (height) of graph in `shiny` app
Women in the Workplace	54:05	Summary of screencast

Board Game Reviews

Back to summary

Screencast	Time	Description
Board Game Reviews	2:50	Starting EDA (exploratory data analysis) with counts of categorical variables
Board Game Reviews	7:25	Specifying `scale_x_log10` function's `breaks` argument to get sensisble tick marks for time on histogram
Board Game Reviews	8:45	Tweaking `geom_histogram` function's `binwidth` argument to get something that makes sense for log scale
Board Game Reviews	10:10	Using `separate_rows` to break down comma-separated values for three different categorical variables
Board Game Reviews	15:55	Using `top_n` to get top 20 observations from each of several categories (not quite right, fixed at 17:47)
Board Game Reviews	16:15	Troubleshooting various issues with facetted graph (e.g., ordering, values appearing in multiple categories)
Board Game Reviews	19:55	Starting prediction of average rating with a linear model
Board Game Reviews	20:50	Splitting data into train/test sets (training/holdout)
Board Game Reviews	22:55	Investigating relationship between max number of players and average rating (to determine if it should be in linear model)
Board Game Reviews	25:05	Exploring average rating over time ("Do newer games tend to be rated higher/lower?")
Board Game Reviews	27:35	Discussing necessity of controlling for year a game was published in the linear model
Board Game Reviews	28:30	Non-model approach to exploring relationship between game features (e.g., card game, made in Germany) on average rating
Board Game Reviews	30:50	Using `geom_boxplot` function to create boxplot of average ratings for most common game features
Board Game Reviews	34:05	Using `unite` function to combine multiple variables into one
Board Game Reviews	37:25	Introducing Lasso regression as good option when you have many features likely to be correlated with one another
Board Game Reviews	38:15	Writing code to set up Lasso regression using `glmnet` and `tidytext` packages
Board Game Reviews	40:05	Adding average rating to the feature matrix (warning: method is messy)
Board Game Reviews	41:40	Using `setdiff` function to find games that are in one set, but not in another (while setting up matrix for Lasso regression)
Board Game Reviews	44:15	Spotting the error stemming from the step above (calling row names from the wrong data)
Board Game Reviews	45:45	Explaining what a Lasso regression does, including the penalty parameter lambda
Board Game Reviews	48:35	Using a cross-validated Lasso model to choose the level of the penalty parameter (lambda)
Board Game Reviews	51:35	Adding non-categorical variables to the Lasso model to control for them (e.g., max number of players)
Board Game Reviews	55:15	Using `unite` function to combine multiple variables into one, separated by a colon
Board Game Reviews	58:45	Graphing the top 20 coefficients in the Lasso model that have the biggest effect on predicted average rating
Board Game Reviews	1:00:55	Mentioning the yardstick package as a way to evaluate the model's performance
Board Game Reviews	1:01:15	Discussing drawbacks of linear models like Lasso (can't do non-linear relationships or interaction effects)

Seattle Pet Names

Back to summary

Screencast	Time	Description
Seattle Pet Names	2:40	Using `mdy` function from `lubridate` package to convert character-formatted date to date-class
Seattle Pet Names	4:20	Exploratory bar graph showing top species of cats, using `geom_col` function
Seattle Pet Names	6:30	Specifying `facet_wrap` function's `ncol` argument to get graphs stacked vertically (instead of side-by-side)
Seattle Pet Names	9:55	Asking, "Are some animal names associated with particular dog breeds?"
Seattle Pet Names	11:15	Explanation of `add_count` function
Seattle Pet Names	12:35	Adding up various metrics (e.g., number of names overall, number of breeds overall), but note a mistake that gets fixed at 17:05
Seattle Pet Names	16:10	Calculating a ratio for names that appear over-represented within a breed, then explaining how small samples can be misleading
Seattle Pet Names	17:05	Spotting and fixing an aggregation mistake
Seattle Pet Names	17:55	Explanation of how to investigate which names might be over-represented within a breed
Seattle Pet Names	18:55	Explanation of how to use hypergeometric distribution to test for name over-representation
Seattle Pet Names	20:40	Using `phyper` function to calculate p-values for a one-sided hypergeometric test
Seattle Pet Names	23:30	Additional explanation of hypergeometric distribution
Seattle Pet Names	24:00	First investigation of why and how to interpret a p-value histogram (second at 29:45, third at 37:45, and answer at 39:30)
Seattle Pet Names	25:15	Noticing that we are missing zeros (i.e., having a breed/name combination with 0 dogs), which is important for the hypergeometric test
Seattle Pet Names	27:10	Using `complete` function to turn implicit zeros (for breed/name combination) into explicit zeros
Seattle Pet Names	29:45	Second investigation of p-value histogram (after adding in implicit zeros)
Seattle Pet Names	31:55	Explanation of multiple hypothesis testing and correction methods (e.g., Bonferroni, Holm), and applying using `p.adjust` function
Seattle Pet Names	34:25	Explanation of False Discovery Rate (FDR) control as a method for correcting for multiple hypothesis testing, and applying using `p.adjust` function
Seattle Pet Names	37:45	Third investigation of p-value histogram, to hunt for under-represented names
Seattle Pet Names	39:30	Answer to why the p-value distribution is not well-behaved
Seattle Pet Names	42:40	Using `crossing` function to created a simulated dataset to explore how different values affect the p-value
Seattle Pet Names	44:55	Explanation of how total number of names and total number of breeds affects p-value
Seattle Pet Names	46:00	More general explanation of what different shapes of p-value histogram might indicate
Seattle Pet Names	47:30	Renaming variables within a `transmute` function, using backticks to get names with spaces in them
Seattle Pet Names	49:20	Using `kable` function from the `knitr` package to create a nice-looking table
Seattle Pet Names	50:00	Explanation of one-side p-value (as opposed to two-sided p-value)
Seattle Pet Names	53:55	Summary of screencast

Seattle Bike Counts

Back to summary

Screencast	Time	Description
Seattle Bike Counts	6:15	Using `summarise_all` and `summarise_at` functions to aggregate multiple variables at the same time
Seattle Bike Counts	8:15	Using magnitude instead of absolute numbers to see trends in time of day
Seattle Bike Counts	12:00	Dividing time into categories (four categories for times of day, e.g., morning commute, night) using `between` function
Seattle Bike Counts	15:00	Looking for systematically missing data (which would bias the results of the analysis)
Seattle Bike Counts	19:45	Summarising using a filter in the arguments based on whether the time window is during a commute time
Seattle Bike Counts	22:45	Combining day of week and hour using functions in the `lubridate` package and `as.difftime` function (but then he uses facetting as an easier method)
Seattle Bike Counts	26:30	Normalizing day of week data to percent of weekly traffic
Seattle Bike Counts	42:00	Starting analysis of directions of travel by time of day (commute vs. reverse-commute)
Seattle Bike Counts	43:45	Filtering out weekend days using wday function from `lubridate` package
Seattle Bike Counts	45:30	Using `spread` function to create new variable of ratio of bike counts at different commute times
Seattle Bike Counts	47:30	Visualizing ratio of bike counts by time of day
Seattle Bike Counts	50:15	Visualizing ratio by hour instead of time of day
Seattle Bike Counts	52:50	Ordering crossing in graph by when the average trip happened using mean of hour weighted by bike count
Seattle Bike Counts	54:50	Quick and dirty filter when creating a new variable within a `mutate` function

Tennis Tournaments

Back to summary

Screencast	Time	Description
Tennis Tournaments	5:00	Identifying duplicated rows ands fixing them
Tennis Tournaments	11:15	Using `add_count` and `fct_reorder` functions to order categories that are broken down into sub-categories for graphing
Tennis Tournaments	13:00	Tidying graph titles (e.g., replacing underscores with spaces) using `str_to_title` and `str_replace` functions
Tennis Tournaments	15:00	Using `inner_join` function to merge datasets
Tennis Tournaments	15:30	Calculating age from date of birth using `difftime` and `as.numeric` functions
Tennis Tournaments	16:35	Adding simple calculations like `mean` and `median` into the text portion of markdown document
Tennis Tournaments	17:45	Looking at distribution of wins by sex using overlapping histograms
Tennis Tournaments	18:55	Binning years into decades using truncated division `%/%`
Tennis Tournaments	20:15	Splitting up boxplots so that they are separated into pairs (M/F) across a different group (decade) using `interaction` function
Tennis Tournaments	20:30	Analyzing distribution of ages across decades, looking specifically at the effect of Serena Williams (one individual having a disproportionate affect on the data, making it look like there's a trend)
Tennis Tournaments	24:30	Avoiding double-counting of individuals by counting their average age instead of their age at each win
Tennis Tournaments	30:20	Starting analysis to predict winner of Grand Slam tournaments
Tennis Tournaments	35:00	Creating rolling count using `row_number` function to make a count of previous tournament experience
Tennis Tournaments	39:45	Creating rolling win count using `cumsum` function
Tennis Tournaments	41:00	Lagging rolling win count using `lag` function (otherwise we get information about a win before a player has actually won, for prediction purposes)
Tennis Tournaments	43:30	Asking, "When someone is a finalist, what is their probability of winning as a function of previous tournaments won?"
Tennis Tournaments	48:00	Asking, "How does the number of wins a finalist has affect their chance of winning?"
Tennis Tournaments	49:00	Backtesting simple classifier where person with more tournament wins is predicted to win the given tournament
Tennis Tournaments	51:45	Creating classifier that gives points based on how far a player got in previous tournaments
Tennis Tournaments	52:55	Using `match` function to turn name of round reached (1st round, 2nd round, …) into a number score (1, 2, …)
Tennis Tournaments	54:20	Using `cummean` function to get score of average past performance (instead of `cumsum` function)
Tennis Tournaments	1:04:10	Pulling names of rounds (1st round, 2nd round, … ) based on the rounded numeric score of previous performance

Bird Collisions

Back to summary

Screencast	Time	Description
Bird Collisions	2:45	Analyzing when NAs appear in a dimension
Bird Collisions	7:30	Looking at multiple categorical variable at the same time by gathering them into one column and eventually graphing each as a different facet
Bird Collisions	9:30	Re-order facet graphs according to which ones have the fewest categories in them to ones that have the most
Bird Collisions	20:45	Geometric mean for estimating counts when there are a lot of low values (1-3 bird collisions, in this case)
Bird Collisions	23:15	Filling in "blank" observations where there were no observations made
Bird Collisions	27:00	Using log+1 to convert a dimension with values of 0 into a log scale
Bird Collisions	29:00	Adding confidence bounds for data using a geometric mean (where he first gets the idea of bootstrapping)
Bird Collisions	32:00	Actual coding of bootstrap starts
Bird Collisions	38:30	Adding confidence bounds using bootstrap data
Bird Collisions	42:00	Investigating potential confounding variables
Bird Collisions	44:15	Discussing approaches to dealing with confounding variables
Bird Collisions	46:45	Using `complete` function to get explicit NA values

Student Teacher Ratios

Back to summary

Screencast	Time	Description
Student-Teacher Ratios	7:30	Using `slice` function to select 10 highest and 10 lowest student-teacher ratios (like a filter using row numbers)
Student-Teacher Ratios	12:35	Adding GDP per capita to a dataset using `WDI` package
Student-Teacher Ratios	17:40	Using `geom_text` to add labels to points on a scatterplot
Student-Teacher Ratios	19:00	Using `WDIsearch` function from `WDI` package to search for country population data
Student-Teacher Ratios	23:20	Explanation of trick with `geom_text` function's check_overlap argument to get label for US to appear by rearranging row order
Student-Teacher Ratios	25:45	Using `comma_format` function from `scales` format to get more readable numeric legend (e.g., "500,000,000" instead of "5e+08")
Student-Teacher Ratios	27:55	Exploring different education-related indicators in the `WDI` package
Student-Teacher Ratios	31:55	Using `spread` function (now `pivot_wider`) to turn data from tidy to wide format
Student-Teacher Ratios	32:15	Using `to_snake_case` function from `snakecase` package to convert field names to snake_case
Student-Teacher Ratios	48:30	Exploring female/male school secondary school enrollment
Student-Teacher Ratios	51:50	Note of caution on keeping confounders in mind when interpreting scatterplots
Student-Teacher Ratios	52:30	Creating a linear regression of secondary school enrollment to explore confounders
Student-Teacher Ratios	54:30	Discussing the actual confounder (GDP per capita) in the linear regression above
Student-Teacher Ratios	57:20	Adding world region as another potential confounder
Student-Teacher Ratios	58:00	Using `aov` function (ANOVA) to explore confounders further
Student-Teacher Ratios	1:06:50	Reviewing and interpreting the final linear regression model
Student-Teacher Ratios	1:08:00	Using `cor` function (correlation) to get correlation matrix for three variables (and brief explanation of multi-collinearity)
Student-Teacher Ratios	1:10:10	Summary of screencast

Nobel Prize Winners

Back to summary

Screencast	Time	Description
Nobel Prize Winners	2:00	Creating a stacked bar plot using `geom_col` and the `aes` function's `fill` argument (also bins years into decades with truncated division operator `%/%`)
Nobel Prize Winners	3:30	Using `n_distinct` function to quickly count unique years in a group
Nobel Prize Winners	9:00	Using `distinct` function and its `.keep_all` argument to de-duplicate data
Nobel Prize Winners	10:50	Using `coalesce` function to replace NAs in a variable (similar to SQL COALESCE verb)
Nobel Prize Winners	16:10	Using `year` function from `lubridate` package to calculate (approx.) age of laureates at time of award
Nobel Prize Winners	16:50	Using `fct_reorder` function to arrange boxplot graph by the median age of winners
Nobel Prize Winners	22:50	Defining a new variable within the `count` function (like doing a `mutate` in the `count` function)
Nobel Prize Winners	23:40	Creating a small multiples bar plot using `geom_col` and `facet_wrap` functions
Nobel Prize Winners	26:15	Importing income data from `WDI` package to explore relationship between high/low income countries and winners
Nobel Prize Winners	33:45	Using `fct_relevel` to change the levels of a categorical income variable (e.g., "Upper middle income") so that the ordering makes sense
Nobel Prize Winners	36:25	Starting to explore new dataset of nobel laureate publications
Nobel Prize Winners	44:25	Taking the mean of a subset of data without needing to fully filter the data beforehand
Nobel Prize Winners	49:15	Using `rank` function and its `ties.method` argument to add the ordinal number of a laureate's publication (e.g., 1st paper, 2nd paper)
Nobel Prize Winners	1:05:10	Lots of playing around with exploratory histograms (`geom_histogram`)
Nobel Prize Winners	1:06:45	Discussion of right-censoring as an issue (people winning the Nobel prize but still having active careers)
Nobel Prize Winners	1:10:20	Summary of screencast

Plastic Waste

Back to summary

Screencast	Time	Description
Plastic Waste	1:45	Using `summarise_all` to get proportion of NA values across many variables
Plastic Waste	16:50	Adding text labels to scatter plot for some points using check_overlap argument
Plastic Waste	21:45	Using `pmin` function to get the lower of two possible numbers for a percentage variable that was showing > 100%
Plastic Waste	29:00	Starting to make a choropleth map
Plastic Waste	29:30	Connecting ISO country names (used in mapping code) to country names given in the dataset
Plastic Waste	32:00	Actual code to create the map using given longitude and latitude
Plastic Waste	33:45	Using `fuzzyjoin` package to link variables that use regular expression instead of character (using `regex_right_join` / `regex_left_join` function)
Plastic Waste	36:15	Using `coord_fixed` function as a hack to get proper ratios for maps
Plastic Waste	39:30	Bringing in additional data using `WDI` package
Plastic Waste	47:30	Using `patchwork` package to show multiple graphs in the same plot
Plastic Waste	53:00	Importing and renaming multiple indicators from the `WDI` package at the same time

Wine Ratings

Back to summary

Screencast	Time	Description
Wine Ratings	3:15	Using `extract` function from `tidyr` package to pull out year from text field
Wine Ratings	9:15	Changing `extract` function to pull out year column more accurately
Wine Ratings	13:00	Starting to explore prediction of points
Wine Ratings	17:00	Using `fct_lump` on country variable to collapse countries into an "Other" category, then `fct_relevel` to set the baseline category for a linear model
Wine Ratings	21:30	Investigating year as a potential confounding variable
Wine Ratings	24:45	Investigating "taster_name" as a potential confounding variable
Wine Ratings	27:45	Coefficient (TIE fighter) plot to see effect size of terms in a linear model, using `tidy` function from `broom` package
Wine Ratings	30:45	Polishing category names for presentation in graph using `str_replace` function
Wine Ratings	32:15	Using `augment` function to add predictions of linear model to original data
Wine Ratings	33:30	Plotting predicted points vs. actual points
Wine Ratings	34:45	Using ANOVA to determine the amount of variation that explained by different terms
Wine Ratings	36:45	Using `tidytext` package to set up wine review text for Lasso regression
Wine Ratings	40:00	Setting up and using `pairwise_cor` function to look at words that appear in reviews together
Wine Ratings	45:00	Creating sparse matrix using `cast_sparse` function from `tidytext` package; used to perform a regression on positive/negative words
Wine Ratings	46:45	Checking if row names of sparse matrix correspond to the wine_id values they represent
Wine Ratings	47:00	Setting up sparse matrix for using `glmnet` package to do sparse regression using Lasso method
Wine Ratings	48:15	Actually writing code for doing Lasso regression
Wine Ratings	49:45	Basic explanation of Lasso regression
Wine Ratings	51:00	Putting Lasso model into tidy format
Wine Ratings	53:15	Explaining how the number of terms increases as lambda (penalty parameter) decreases
Wine Ratings	54:00	Answering how we choose a lambda value (penalty parameter) for Lasso regression
Wine Ratings	56:45	Using parallelization for intensive computations
Wine Ratings	58:30	Adding price (from original linear model) to Lasso regression
Wine Ratings	1:02:15	Shows glmnet.fit piece of a Lasso model (using `glmnet` package)
Wine Ratings	1:03:30	Picking a lambda value (penalty parameter) and explaining which one to pick
Wine Ratings	1:08:15	Taking most extreme coefficients (positive and negative) by grouping theme by direction
Wine Ratings	1:10:30	Demonstrating `tidytext` package's sentiment lexicon, then looking at individual reviews to demonstrate the model
Wine Ratings	1:17:30	Visualizing each coefficient's effect on a single review
Wine Ratings	1:20:30	Using `str_trunc` to truncate character strings

Ramen Reviews

Back to summary

Screencast	Time	Description
Ramen Reviews	1:45	Looking at the website the data came from
Ramen Reviews	2:55	Using `gather` function (now `pivot_longer`) to convert wide data to long (tidy) format
Ramen Reviews	4:15	Graphing counts of all categorical variables at once, then exploring them
Ramen Reviews	5:35	Using `fct_lump` function to lump three categorical variables to the top N categories and "Other"
Ramen Reviews	7:45	Using `reorder_within` function to re-order factors that have the same name across multiple facets
Ramen Reviews	9:10	Using `lm` function (linear model) to predict star rating
Ramen Reviews	9:50	Visualising effects (and 95% CI) of indendent variables in linear model with a coefficient plot (TIE fighter plot)
Ramen Reviews	11:30	Using `fct_relevel` function to get "Other" as the base reference level for categorical independent variables in a linear model
Ramen Reviews	13:05	Using `extract` function and regex to split a camelCase variable into two separate variables
Ramen Reviews	14:45	Using `facet_wrap` function to split coefficient / TIE fighter plot into three separate plots, based on type of coefficient
Ramen Reviews	15:40	Using `geom_vline` function to add reference line to graph
Ramen Reviews	17:20	Using `unnest_tokens` function from `tidytext` package to explore the relationship between variety (a sparse categorical variable) and star rating
Ramen Reviews	18:55	Explanation of how he would approach variety variable with Lasso regression
Ramen Reviews	19:35	Web scraping the using `rvest` package and `SelectorGadget` (Chrome Extension CSS selector)
Ramen Reviews	21:20	Actually writing code for web scraping, using `read_html`, `html_node`, and `html_table` functions
Ramen Reviews	22:25	Using `clean_names` function from `janitor` package to clean up names of variables
Ramen Reviews	23:05	Explanation of web scraping task: get full review text using the links from the review summary table scraped above
Ramen Reviews	25:40	Using `parse_number` function as alternative to `as.integer` function to cleverly drop extra weird text in review number
Ramen Reviews	26:45	Using `SelectorGadget` (Chrome Extension CSS selector) to identify part of page that contains review text
Ramen Reviews	27:35	Using `html_nodes`, `html_text`, and `str_subset` functions to write custom function to scrape review text identified in step above
Ramen Reviews	29:15	Adding `message` function to custom scraping function to display URLs as they are being scraped
Ramen Reviews	30:15	Using `unnest_tokens` and `anti_join` functions to split review text into individual words and remove stop words (e.g., "the", "or", "and")
Ramen Reviews	31:05	Catching a mistake in the custom function causing it to read the same URL every time
Ramen Reviews	31:55	Using `str_detect` function to filter out review paragraphs without a keyword in it
Ramen Reviews	32:40	Using `str_remove` function and regex to get rid of string that follows a specific pattern
Ramen Reviews	34:10	Explanation of `possibly` and `safely` functions in `purrr` package
Ramen Reviews	37:45	Reviewing output of the URL that failed to scrape, including using `character(0)` as a default null value
Ramen Reviews	48:00	Using `pairwise_cor` function from `widyr` package to see which words tend to appear in reviews together
Ramen Reviews	51:05	Using `igraph` and `ggraph` packages to make network plot of word correlations
Ramen Reviews	51:55	Using `geom_node_text` function to add labels to network plot
Ramen Reviews	52:35	Including all words (not just those connected to others) as vertices in the network plot
Ramen Reviews	54:40	Tweaking and refining network plot aesthetics (vertex size and colour)
Ramen Reviews	56:00	Weird hack for getting a dark outline on hard-to-see vertex points
Ramen Reviews	59:15	Summary of screencast

Media Franchise Revenue

Back to summary

Screencast	Time	Description
Media Franchise Revenue	9:15	Explaining use of `semi_join` function to aggregate and filter groups
Media Franchise Revenue	11:00	Putting the largest categories on the bottom of a stacked bar chart
Media Franchise Revenue	14:30	Using `glue` function as alternative to `paste` for combining text, plus good explanation of it
Media Franchise Revenue	19:30	Multiple re-ordering using `fct_reorder` function of facetted graph (he works through several obstacles)
Media Franchise Revenue	20:40	Re-ordering the position of facetted graphs so that highest total revenue is at top left
Media Franchise Revenue	26:00	Investigating relationship between year created and revenue
Media Franchise Revenue	26:40	Creating scatter plot with points scaled by size and labelled points (`geom_text` function)
Media Franchise Revenue	29:30	Summary of screencast up to this point
Media Franchise Revenue	29:50	Starting analysis original media of franchise (e.g., novel, video game, animated film) and revenue type (e.g., box office, merchandise)
Media Franchise Revenue	33:35	Graphing original media and revenue category as facetted bar plot with lots of reordering (ends at around 38:40)
Media Franchise Revenue	40:30	Alternative visualization of original media/revenue category using heat map
Media Franchise Revenue	41:20	Using `scale_fill_gradient2` function to specify custom colour scale
Media Franchise Revenue	42:05	Getting rid of gridlines in graph using `theme` function's panel.grid argument
Media Franchise Revenue	44:05	Using `fct_rev` function to reverse levels of factors
Media Franchise Revenue	44:35	Fixing overlapping axis text with tweaks to `theme` function's axis.text argument
Media Franchise Revenue	46:05	Reviewing visualization that inspired this dataset
Media Franchise Revenue	47:25	Adding text of total revenue to the end of each bar in a previous graph
Media Franchise Revenue	50:20	Using `paste0` function at add a "B" (for "billions") to the end of text labels on graph
Media Franchise Revenue	51:35	Using `expand_limits` functions to give more space for text labels not to get cut off
Media Franchise Revenue	53:45	Summary of screencast

Women's World Cup

Back to summary

Screencast	Time	Description
Women's World Cup	2:15	Adding country names using `countrycode` package
Women's World Cup	3:45	Web scraping country codes from Wikipedia
Women's World Cup	6:00	Combining tables that are separate lists into one dataframe
Women's World Cup	14:00	Using `rev` function (reverse) to turn multiple rows of soccer match scores into one row (base team and opposing team)
Women's World Cup	26:30	Applying a `geom_smooth` linear model line to a scatter plot, then facetting it
Women's World Cup	28:30	Adding a line with a slope of 1 (x = y) using `geom_abline`
Women's World Cup	40:00	Pulling out elements of a list that is embedded in a dataframe
Women's World Cup	1:09:45	Using `glue` function to add context to facet titles

Bob Ross Paintings

Back to summary

Screencast	Time	Description
Bob Ross Paintings	1:40	Using `clean_names` function in `janitor` package to get field names to snake_case
Bob Ross Paintings	1:50	Using `gather` function (now 'pivot_longer') to get wide elements into tall (tidy) format
Bob Ross Paintings	2:35	Cleaning text (`str_to_title`, `str_replace`) to get into nicer-to-read format
Bob Ross Paintings	3:30	Using `str_remove_all` function to trim trimming quotation marks and backslashes
Bob Ross Paintings	4:40	Using `extract` function to extract the season number and episode number from episode field; uses regex capturing groups
Bob Ross Paintings	14:00	Using `add_count` function's name argument to specify field's name
Bob Ross Paintings	15:35	Getting into whether the elements of Ross's paintings changed over time (e.g., are mountains more/less common over time?)
Bob Ross Paintings	20:00	Quick point: could have used logistic regression to see change over time of elements
Bob Ross Paintings	21:10	Asking, "What elements tends to appear together?" prompting clustering analysis
Bob Ross Paintings	22:15	Using `pairwise_cor` to see which elements tend to appear together
Bob Ross Paintings	22:50	Discussion of a blind spot of pairwise correlation (high or perfect correlation on elements that only appear once or twice)
Bob Ross Paintings	28:05	Asking, "What are clusters of elements that belong together?"
Bob Ross Paintings	28:30	Creating network plot using `ggraph` and `igraph` packages
Bob Ross Paintings	30:15	Reviewing network plot for interesting clusters (e.g., beach cluster, mountain cluster, structure cluster)
Bob Ross Paintings	31:55	Explanation of Principal Component Analysis (PCA)
Bob Ross Paintings	34:35	Start of actual PCA coding
Bob Ross Paintings	34:50	Using `acast` function to create matrix of painting titles x painting elements (initially wrong, corrected at 36:30)
Bob Ross Paintings	36:55	Centering the matrix data using `t` function (transpose of matrix), `colSums` function, and `colMeans` functions
Bob Ross Paintings	38:15	Using `svd` function to performn singular value decomposition, then tidying with `broom` package
Bob Ross Paintings	39:55	Exploring one principal component to get a better feel for what PCA is doing
Bob Ross Paintings	43:20	Using `reorder_within` function to re-order factors within a grouping
Bob Ross Paintings	48:00	Exploring different matrix names in PCA (u, v, d)
Bob Ross Paintings	56:50	Looking at top 6 principal components of painting elements
Bob Ross Paintings	57:45	Showing percentage of variation that each principal component is responsible for

Simpsons Guest Stars

Back to summary

Screencast	Time	Description
Simpsons Guest Stars	4:15	Using `str_detect` function to find guests that played themselves
Simpsons Guest Stars	7:55	Using `separate_rows` function and regex to get delimited values onto different rows (e.g., "Edna Krabappel; Ms. Melon" gets split into two rows)
Simpsons Guest Stars	9:55	Using `parse_number` function to convert a numeric variable coded as character to a proper numeric variable
Simpsons Guest Stars	14:45	Downloading and importing supplementary dataset of dialogue
Simpsons Guest Stars	16:10	Using `semi_join` function to filter dataframe based on values that appear in another dataframe
Simpsons Guest Stars	18:05	Using `anti_join` function to check which values in a dataframe do not appear in another dataframe
Simpsons Guest Stars	20:50	Using `ifelse` function to recode a single value with another (i.e., "Edna Krapabbel" becomes "Edna Krabappel-Flanders")
Simpsons Guest Stars	26:20	Explaining the goal of all the data cleaning steps
Simpsons Guest Stars	31:25	Using `sample` function to get an example line for each character
Simpsons Guest Stars	33:20	Setting `geom_histogram` function's `binwidth` and `center` arguments to get specific bin sizes
Simpsons Guest Stars	37:25	Using `unnest_tokens` and `anti_join` functions from `tidytext` package to split dialogue into individual words and remove stop words (e.g., "the", "or", "and")
Simpsons Guest Stars	38:55	Using `bind_tf_idf` function from `tidytext` package to get the TF-IDF (term frequency-inverse document frequency) of individual words
Simpsons Guest Stars	42:50	Using `top_n` function to get the top 1 TF-IDF value for each role
Simpsons Guest Stars	44:05	Using `paste0` function to combine two character variables (e.g., "Groundskeeper Willie" and "ach" (separate variables) become "Groundskeeper Willie: ach")
Simpsons Guest Stars	48:10	Explanation of what TF-IDF (text frequency-inverse document frequency) tells us and how it is a "catchphrase detector"
Simpsons Guest Stars	56:40	Summary of screencast

Pizza Ratings

Back to summary

Screencast	Time	Description
Pizza Ratings	4:45	Transforming time into something more readable (from time value of seconds since Unix epoch 1970-01-01), then converting it into a date
Pizza Ratings	9:05	Formatting x-axis text so that it is rotated and readable, then re-ordering using `fct_relevel` function so that it is in its proper ordinal order
Pizza Ratings	11:00	Converting string answers to integer counterparts to get an overall numeric value for how good each place is
Pizza Ratings	12:30	Commentary on speed of `mutate` calculation within or without a group (non-grouped is slightly faster)
Pizza Ratings	15:30	Re-ordering groups by total votes using `fct_reorder` function, while still maintaining the groups themselves
Pizza Ratings	19:15	Using `glue` package to combine place name and total respondents
Pizza Ratings	20:30	Using statistical test to give confidence intervals on average score
Pizza Ratings	22:15	Actually using the `t.test` function with toy example
Pizza Ratings	23:15	Using weighted linear model instead (which doesn't end up working)
Pizza Ratings	26:00	Using custom function with `rep` function to get vector of repeated scores (sneaky way of weighting) so that we can perform a proper t-test
Pizza Ratings	27:30	Summarizing `t.test` function into a list (alternative to nesting)
Pizza Ratings	31:20	Adding error bars using `geom_errorbarh` to make a TIE fighter plot that shows confidence intervals
Pizza Ratings	36:30	Bringing in additional data from Barstool ratings (to supplement survey of Open R meetup NY)
Pizza Ratings	39:45	Getting survey data to the place level so that we can add an additional dataset
Pizza Ratings	41:15	Checking for duplicates in the joined data
Pizza Ratings	42:15	Calling off the planned analysis due to low sample sizes (too much noise, not enough overlap between datasets)
Pizza Ratings	45:15	Looking at Barstool data on its own
Pizza Ratings	55:15	Renaming all variables with a certain string pattern in them
Pizza Ratings	58:00	Comparing Dave's reviews with all other critics
Pizza Ratings	59:15	Adding `geom_abline` showing x = y as comparison for `geom_smooth` linear model line
Pizza Ratings	1:02:30	Changing the location of the `aes` function to change what the legend icons look like for size aesthetic

Car Fuel Efficiency

Back to summary

Screencast	Time	Description
Car Fuel Efficiency	3:20	Using `select`, `sort`, and `colnames` functions to sort variables in alphabetical order
Car Fuel Efficiency	10:00	Adding `geom_abline` for y = x to a scatter plot for comparison
Car Fuel Efficiency	18:00	Visualising using `geom_boxplot` for mpg by vehicle class (size of car)
Car Fuel Efficiency	24:45	Start of explanation of prediction goals
Car Fuel Efficiency	27:00	Creating train and test sets, along with trick using `sample_frac` function to randomly re-arrange all rows in a dataset
Car Fuel Efficiency	28:35	First step of developing linear model: visually adding `geom_smooth`
Car Fuel Efficiency	30:00	Using `augment` function to add extra variables from model to original dataset (fitted values and residuals, especially)
Car Fuel Efficiency	30:45	Creating residuals plot and explaining what you want and don't want to see
Car Fuel Efficiency	31:50	Explanation of splines
Car Fuel Efficiency	33:30	Visualising effect of regressing using natural splines
Car Fuel Efficiency	35:10	Creating a tibble to test different degrees of freedom (1:10) for natural splines
Car Fuel Efficiency	36:30	Using `unnest` function to get tidy versions of different models
Car Fuel Efficiency	37:55	Visualising fitted values of all 6 different models at the same time
Car Fuel Efficiency	42:10	Investigating whether the model got "better" as we added degrees of freedom to the natural splines, using the `glance` function
Car Fuel Efficiency	47:45	Using ANOVA to perform a statistical test on whether natural splines as a group explain variation in MPG
Car Fuel Efficiency	48:30	Exploring colinearity of dependant variables (displacement and cylinders)
Car Fuel Efficiency	55:10	Binning years into every two years using `floor` function
Car Fuel Efficiency	56:40	Using `summarise_at` function to do quick averaging of multiple variables

Horror Movies

Back to summary

Screencast	Time	Description
Horror Movies	4:15	Extracting digits (release year) from character string using regex, along with good explanation of `extract` function
Horror Movies	8:00	Quick check on why `parse_number` is unable to parse some values -- is it because they are NA or some other reason?
Horror Movies	9:45	Visually investigating correlation between budget and rating
Horror Movies	11:50	Investigating correlation between MPAA rating (PG-13, R, etc.) and rating using boxplots
Horror Movies	12:50	Using `pull` function to quickly check levels of a factor
Horror Movies	13:30	Using ANOVA to check difference of variation within groups (MPAA rating) than between groups
Horror Movies	15:40	Separating genre using `separate_rows` function (instead of `str_split` and `unnest`)
Horror Movies	18:00	Removing boilerplate "Directed by..." and "With..." part of plot variable and isolating plot, first using regex, then by using `separate` function with periods as separator
Horror Movies	20:40	Unnesting word tokens, removing stop words, and counting appearances
Horror Movies	21:20	Aggregating by word to find words that appear in high- or low-rated movies
Horror Movies	23:00	Discussing potential confounding factors for ratings associated with specific words
Horror Movies	24:50	Searching for duplicated movie titles
Horror Movies	25:50	De-duping using `distinct` function
Horror Movies	26:55	Loading in and explaining `glmnet` package
Horror Movies	28:00	Using movie titles to pull out ratings using `rownmaes` and `match` functions to create an index of which rating to pull out of the original dataset
Horror Movies	29:10	Actually using `glmnet` function to create lasso model
Horror Movies	34:05	Showing built-in plot of lasso lambda against mean-squared error
Horror Movies	37:05	Explaining when certain terms appeared in the lasso model as the lambda value dropped
Horror Movies	41:10	Gathering all variables except for title, so that the dataset is very tall
Horror Movies	42:35	Using `unite` function to combine two variables (better alternative to `paste`)
Horror Movies	45:45	Creating a new lasso with tons of new variables other than plot words

NYC Squirrel Census

Back to summary

Screencast	Time	Description
NYC Squirrel Census	5:45	Starter EDA of latitude and longitude using `geom_point`
NYC Squirrel Census	6:45	Aggregating squirrel counts by hectare to get a "binned" map
NYC Squirrel Census	9:00	Investigating colour notes
NYC Squirrel Census	10:30	Asking question, "Are there areas of the parks where we see certain-coloured squirrels
NYC Squirrel Census	12:45	Plotting latitude and percentage of gray squirrels to answer, "Do we get a lower proportion of gray squirrels as we go farther north?"
NYC Squirrel Census	13:30	Using logistic regression to test gray squirrel (proportion as we go farther north)
NYC Squirrel Census	16:30	Noting that he could have used original data sets as input for logistic regression function
NYC Squirrel Census	19:30	"Does a squirrel run away?" based on location in the park (latitude), using logistic regression
NYC Squirrel Census	20:45	Using `summarise_at` function to apply same function to multiple variables
NYC Squirrel Census	25:25	Loading `ggmap` package
NYC Squirrel Census	27:00	Start using `ggmap`, with the `get_map` function
NYC Squirrel Census	28:20	Decision to not set up Google API key to use `ggmap` properly
NYC Squirrel Census	30:15	Using the `sf` package to read in a shapefile of Central Park
NYC Squirrel Census	30:40	Using `read_sf` function from `sf` package to import a shapefile into R
NYC Squirrel Census	31:30	Using `geom_sf` function from `sf` package to visualise the imported shapefile
NYC Squirrel Census	32:45	Combining shapefile "background" with relevant squirrel data in one plot
NYC Squirrel Census	34:40	Visualising pathways (footpaths, bicycle paths) in the shapefile
NYC Squirrel Census	37:55	Finishing visualisation and moving on to analysing activity types
NYC Squirrel Census	38:45	Selecting fields based on whether they end with "ing", then gathering those fields into tidy format
NYC Squirrel Census	39:50	Decision to create a `shiny` visualisation
NYC Squirrel Census	41:30	Setting `shiny` app settings (e.g., slider for minimum number of squirrels)
NYC Squirrel Census	42:15	Setting up `shiny` app options / variables
NYC Squirrel Census	43:50	Explanation of why setting up options in `shiny` app the way he did
NYC Squirrel Census	46:00	Solving error "Discrete value supplied to continuous scale"
NYC Squirrel Census	46:50	First draft of `shiny` app
NYC Squirrel Census	48:35	Creating a dynamic midpoint for the two-gradient scale in the `shiny` app
NYC Squirrel Census	51:30	Adding additional variables of more behaviours to `shiny` app (kuks, moans, runs from, etc.)
NYC Squirrel Census	53:10	"What are the distributions of some of these behaviours?"
NYC Squirrel Census	56:50	Adding ground location (above ground, ground plane) to `shiny` app
NYC Squirrel Census	58:20	Summary of screencast

CRAN Package Code

Back to summary

Screencast	Time	Description
CRAN Package Code	4:30	Summarizing many things by language (e.g., lines of code, comment/code ratio)
CRAN Package Code	9:35	Using `gather` function (now `pivot_longer`) to consolidate multiple metrics into one dimension, then visualizing by facetting by metric
CRAN Package Code	11:20	Setting ncol = 1 within `facet_wrap` function to get facetted graphs to stack vertically
CRAN Package Code	11:30	Using `reorder_within` function from `tidytext` package to properly reorder factors within each facet
CRAN Package Code	16:00	Using `geom_text` label to add language name as label to scatter points
CRAN Package Code	20:00	Completing preliminary overview and looking at distribution of R code in packages
CRAN Package Code	26:15	Using `str_extract` to extract only letters and names from character vector (using regex)
CRAN Package Code	34:00	Re-ordering the order of categorical variables in the legend using `guides` function
CRAN Package Code	36:00	Investigating comment/code ratio
CRAN Package Code	43:05	Importing additional package data (looking around for a bit, then starting to actually import ~46:00)
CRAN Package Code	54:40	Importing even more additional data (available packages)
CRAN Package Code	57:50	Using `separate_rows` function to separate delimited values
CRAN Package Code	58:45	Using `extract` function and regex to pull out specific types of characters from a string
CRAN Package Code	1:05:35	Summary of screencast

Riddler: Spelling Bee Honeycomb

Back to summary

Screencast	Time	Description
Riddler: Spelling Bee Honeycomb	2:00	Using `read_lines` function to import a plain text file (.txt)
Riddler: Spelling Bee Honeycomb	2:35	Using `str_detect` function to filter out words that do not contain the letter "g"
Riddler: Spelling Bee Honeycomb	3:25	Using `str_split` function to get a list of a word's individual letters
Riddler: Spelling Bee Honeycomb	3:55	Using `setdiff` function to find words with invalid letters (letters that are not in the puzzle honeycomb) -- also needs `map` function (at 4:35)
Riddler: Spelling Bee Honeycomb	10:45	Changing existing code to make a function that will calculate scores for letter combinations
Riddler: Spelling Bee Honeycomb	14:10	Noticing the rule about bonus points for pangrams and using `n_distinct` function to determine if a word gets those points
Riddler: Spelling Bee Honeycomb	17:25	Using `map` function to eliminate duplicate letters from each word's list of component letters
Riddler: Spelling Bee Honeycomb	25:55	Using `acast` function from `reshape2` package to create a matrix of words by letters
Riddler: Spelling Bee Honeycomb	27:50	Using the words/letters matrix to find valid words for a given letter combination
Riddler: Spelling Bee Honeycomb	29:55	Using the matrix multiplication operator `%*%` to find the number of "forbidden" letters for each word
Riddler: Spelling Bee Honeycomb	42:05	Using `microbenchmark` function from `microbenchmark` package to test how long it takes to run a function
Riddler: Spelling Bee Honeycomb	43:35	Using combn function to get the actual combinations of 6 letters (not just the count)
Riddler: Spelling Bee Honeycomb	45:15	Using `map` function to get scores for different combinations of letters created above
Riddler: Spelling Bee Honeycomb	47:30	Using `which.max` function to find the position of the max value in a vector
Riddler: Spelling Bee Honeycomb	1:05:10	Using `t` function to transpose a matrix
Riddler: Spelling Bee Honeycomb	1:19:15	Summary of screencast

The Office

Back to summary

Screencast	Time	Description
The Office	1:45	Overview of transcripts data
The Office	2:25	Overview of ratintgs data
The Office	4:10	Using `fct_inorder` function to create a factor with levels based on when they appear in the dataframe
The Office	4:50	Using `theme` and `element_text` functions to turn axis labels 90 degrees
The Office	5:55	Creating a line graph with points at each observation (using `geom_line` and `geom_point`)
The Office	7:10	Adding text labels to very high and very low-rated episodes
The Office	8:50	Using `theme` function's `panel.grid.major` argument to get rid of some extraneous gridlines, using `element_blank` function
The Office	10:15	Using `geom_text_repel` from `ggrepel` package to experiment with different labelling (before abandoning this approach)
The Office	12:45	Using `row_number` function to add episode_number field to make graphing easier
The Office	14:05	Explanation of why number of ratings (votes) is relevant to interpreting the graph
The Office	19:10	Using `unnest_tokens` function from `tidytext` package to split full-sentence text field to individual words
The Office	20:10	Using `anti_join` function to filter out stop words (e.g., and, or, the)
The Office	22:25	Using `str_remove_all` function to get rid of quotation marks from character names (quirks that might pop up when parsing)
The Office	25:40	Asking, "Are there words that are specific to certain characters?" (using `bind_tf_idf` function)
The Office	32:25	Using `reorder_within` function to re-order factors within a grouping (when a term appears in multiple groups) and `scale_x_reordered` function to graph
The Office	37:05	Asking, "What effects the popularity of an episode?"
The Office	37:55	Dealing with inconsistent episode names between datasets
The Office	41:25	Using `str_remove` function and some regex to remove "(Parts 1&2)" from some episode names
The Office	42:45	Using `str_to_lower` function to further align episode names (addresses inconsistent capitalization)
The Office	52:20	Setting up dataframe of features for a LASSO regression, with director and writer each being a feature with its own line
The Office	52:55	Using `separate_rows` function to separate episodes with multiple writers so that each has their own row
The Office	58:25	Using `log2` function to transform number of lines fields to something more useable (since it is log-normally distributed)
The Office	1:00:20	Using `cast_sparse` function from `tidytext` package to create a sparse matrix of features by episode
The Office	1:01:55	Using `semi_join` function as a "filtering join"
The Office	1:02:30	Setting up dataframes (after we have our features) to run LASSO regression
The Office	1:03:50	Using `cv.glmnet` function from `glmnet` package to run a cross-validated LASSO regression
The Office	1:05:35	Explanation of how to pick a lambda penalty parameter
The Office	1:05:55	Explanation of output of LASSO model
The Office	1:09:25	Outline of why David likes regularized linear models (which is what LASSO is)
The Office	1:10:55	Summary of screencast

COVID-19 Open Research Dataset (CORD-19)

Back to summary

Screencast	Time	Description
COVID-19 Open Research Dataset (CORD-19)	0:55	Disclaimer that David's not an epidemiologist
COVID-19 Open Research Dataset (CORD-19)	2:55	Overview of dataset
COVID-19 Open Research Dataset (CORD-19)	7:50	Using `dir` function with its `full.names` argument to get file paths for all files in a folder
COVID-19 Open Research Dataset (CORD-19)	9:45	Inspecting JSON-formatted data
COVID-19 Open Research Dataset (CORD-19)	10:40	Introducing `hoist` function as a way to deal with nested lists (typical for JSON data)
COVID-19 Open Research Dataset (CORD-19)	11:40	Continuing to use the `hoist` function
COVID-19 Open Research Dataset (CORD-19)	13:10	Brief explanation of `pluck` specification
COVID-19 Open Research Dataset (CORD-19)	16:35	Using `object.size` function to check size of JSON data
COVID-19 Open Research Dataset (CORD-19)	17:40	Using `map_chr` and `str_c` functions together to combine paragraphs of text in a list into a single character string
COVID-19 Open Research Dataset (CORD-19)	20:00	Using `unnest_tokens` function from `tidytext` package to split full paragraphs into individual words
COVID-19 Open Research Dataset (CORD-19)	22:50	Overview of `scispaCy` package for Python, which has named entity recognition features
COVID-19 Open Research Dataset (CORD-19)	24:40	Introducting `spacyr` package, which is a R wrapper around the Python `scispaCy` package
COVID-19 Open Research Dataset (CORD-19)	28:50	Showing how `tidytext` can use a custom tokenization function (David uses `spacyr` package's named entity recognition)
COVID-19 Open Research Dataset (CORD-19)	32:20	Demonstrating the `tokenize_words` function from the `tokenizers` package
COVID-19 Open Research Dataset (CORD-19)	37:00	Actually using a custom tokenizer in `unnest_tokens` function
COVID-19 Open Research Dataset (CORD-19)	39:45	Using `sample_n` function to get a random sample of n rows
COVID-19 Open Research Dataset (CORD-19)	43:25	Asking, "What are groups of words that tend to occur together?"
COVID-19 Open Research Dataset (CORD-19)	44:30	Using `pairwise_cor` from `widyr` package to find correlation between named entities
COVID-19 Open Research Dataset (CORD-19)	45:40	Using `ggraph` and `igraph` packages to create a network plot
COVID-19 Open Research Dataset (CORD-19)	52:05	Starting to look at papers' references
COVID-19 Open Research Dataset (CORD-19)	53:30	Using `unnest_longer` then `unnest_wider` function to convert lists into a tibble
COVID-19 Open Research Dataset (CORD-19)	59:30	Using `str_trunc` function to truncate long character strings to a certain number of characters
COVID-19 Open Research Dataset (CORD-19)	1:06:25	Using `glue` function for easy combination of strings and R code
COVID-19 Open Research Dataset (CORD-19)	1:19:15	Summary of screencast

CORD-19 Data Package

Back to summary

Screencast	Time	Description
CORD-19 Data Package	1:10	Overview of JSON files with the data David will make a package of
CORD-19 Data Package	3:05	Starting to create a new package with "New Project" in RStudio
CORD-19 Data Package	5:40	Creating a file to reference the license for the dataset
CORD-19 Data Package	7:25	Using `use_data_raw` function from `usethis` package to set up a folder structure and preliminary function for raw data
CORD-19 Data Package	8:30	Explanation that we want to limit the number of packages we load when building a package (e.g., no `library(tidyverse)` )
CORD-19 Data Package	9:00	Using `use_package` function from `usethis` package to add "Suggested packages"
CORD-19 Data Package	10:15	Reviewing import and cleaning code already completed
CORD-19 Data Package	14:55	Using `roxygen2` package to write documentation
CORD-19 Data Package	19:35	More documentation writing
CORD-19 Data Package	24:50	Using `use_data` function from `usethis` package to create a folder structure and datafile for (finished/cleaned) data
CORD-19 Data Package	26:10	Making a mistake clicking "Install and Restart" button on the "Build" tab (because of huge objects in the environment) (see 26:50 for alternative)
CORD-19 Data Package	26:50	Using `load_all` function from `devtrools` package as an alternative to "Install and Restart" from above step
CORD-19 Data Package	27:35	Using `document` function from `devtools` package to process written documentation
CORD-19 Data Package	32:20	De-duplicating paper data in a way the keeps records that have fewer missing values than other records for the same paper
CORD-19 Data Package	39:50	Using `use_data` function with its overwrite argument to overwrite existing data
CORD-19 Data Package	47:30	Writing documentation for paragraphs data
CORD-19 Data Package	57:55	Testing an install of the package
CORD-19 Data Package	59:30	Adding link to code in documentation
CORD-19 Data Package	1:03:00	Writing examples of how to use the package (in documentation)
CORD-19 Data Package	1:08:45	Discussion of outstanding items that David hasn't done yet (e.g., readme, vignettes, tests)
CORD-19 Data Package	1:09:20	Creating a simple readme, including examples, with `use_readme_rmd` function from `usethis` package
CORD-19 Data Package	1:16:10	Using `knit` function from the `knitr` package to knit the readme into a markdown file
CORD-19 Data Package	1:17:10	Creating a GitHub repository to host the package (includes how to commit to a GitHub repo using RStudio's GUI)
CORD-19 Data Package	1:18:15	Explanation that version 0.0.0.9000 means that the package is in early development
CORD-19 Data Package	1:20:30	Actually creating the GitHub repository
CORD-19 Data Package	1:22:25	Overview of remaining tasks

R Trick: Creating Pascal's Triangle with `accumulate()`

Back to summary

Screencast	Time	Description
R trick: Creating Pascal's Triangle with accumulate()	1:10	Simple explanation of `accumulate` function
R trick: Creating Pascal's Triangle with accumulate()	1:30	Example using letters
R trick: Creating Pascal's Triangle with accumulate()	2:55	Using tilde `~` to create an anonymous function
R trick: Creating Pascal's Triangle with accumulate()	4:35	Introducing Pascal's Triangle
R trick: Creating Pascal's Triangle with accumulate()	6:25	Starting to create Pascal's triangle in R
R trick: Creating Pascal's Triangle with accumulate()	8:05	Concerting the conceptual solution into an `accumulate` function

Riddler: Simulating Replacing Die Sides

Back to summary

Screencast	Time	Description
Riddler: Simulating Replacing Die Sides	0:45	Explaining why the recursive nature of this problem is well-suited to simulation
Riddler: Simulating Replacing Die Sides	2:05	Introducing the `accumulate` function as a tool for simulation
Riddler: Simulating Replacing Die Sides	3:50	Creating a condition to call the `done` function
Riddler: Simulating Replacing Die Sides	7:00	After creating a function to simulate one round of the problem, using `replicate` function to run simulation many times
Riddler: Simulating Replacing Die Sides	7:15	Using `qplot` function to quickly create a histogram of simulations
Riddler: Simulating Replacing Die Sides	7:40	Making observations on the distribution of simulations (looks kind of like a gamma distribution)
Riddler: Simulating Replacing Die Sides	10:05	Observing that the distribution is kind of log-normal (but that doesn't really apply because we're using integers)
Riddler: Simulating Replacing Die Sides	10:35	Using `table` and `sort` functions to find the most common number of rolls
Riddler: Simulating Replacing Die Sides	11:20	Starting the Extra Credit portion of the problem (N-sided die)
Riddler: Simulating Replacing Die Sides	11:40	Using the `crossing` function to set up a tibble to run simulations
Riddler: Simulating Replacing Die Sides	12:35	Using `map_dbl` function to apply a set of simulations to each possibility of N sides
Riddler: Simulating Replacing Die Sides	13:30	Spotting an error in the formula for simulating one round (6-sided die was hard-coded)
Riddler: Simulating Replacing Die Sides	16:40	Using simple linear regression with the `lm` function to find the relationship between number of sides and average number of rolls
Riddler: Simulating Replacing Die Sides	17:20	Reviewing distributions for different N-sided dice
Riddler: Simulating Replacing Die Sides	18:00	Calculating variance, standard deviation, and coefficient of variation to get hints on the distribution (and ruling out Poisson)

Beer Production

Back to summary

Screencast	Time	Description
Beer Production	4:25	Asking, "What ingredients are used in beer?"
Beer Production	4:40	Using `filter` and `max` functions to look at the most recent period of time
Beer Production	7:25	Using `paste` and `ymd` functions (`ymd` is from `lubridate` package) to convert year-month field into an date-formatted field
Beer Production	9:20	Spotting potential missing or mis-parsed data
Beer Production	13:50	Introducing the `tidymetrics` framework
Beer Production	14:45	Using `install_github` function to install `tidymetrics` from GitHub
Beer Production	15:25	Using `cross_by_dimensions` function from `tidymetrics` package to get aggregations at different levels of multiple dimensions
Beer Production	18:10	Using `cross_by_periods` function from `tidymetrics` package to also get aggregations for different intervals (e.g, month, quarter, year)
Beer Production	22:00	Using `use_metrics_scaffold` function from `tidymetrics` package to create framework for documenting dimensions in RMarkdown YAML header
Beer Production	24:00	Using `create_metrics` function from `tidymetrics` package to save data as a tibble with useful metadata (good for visualizing interactively)
Beer Production	25:15	Using `preview_metric` function from `shinymetrics` package (still under development as of 2020-04-24) to demonstrate `shinymetrics`
Beer Production	27:35	Succesfuly getting `shinymetrics` to work
Beer Production	28:25	Explanation of the `shinymetrics` bug David ran into
Beer Production	34:10	Changing order of ordinal variable (e.g., "1,000 to 10,000" and "10,000 to 20,000") using the `parse_number`, `fct_lump`, and `coalesce` functions
Beer Production	41:25	Asking, "Where is beer produced?"
Beer Production	46:45	Looking up `sf` package documentation to refresh memory on how to draw state borders for a map
Beer Production	48:55	Using `match` function and `state.abb` vector (state abbreviations) from `sf` package to perform a lookup of state names
Beer Production	51:05	Using `geom_sf` function (and working through some hiccoughs) to create a choropleth map
Beer Production	52:30	Using `theme_map` function from `ggthemes` package to get more appropriate styling for maps
Beer Production	55:40	Experimenting with how to get the legend to display in the bottom right corner
Beer Production	58:25	Starting to build an animation of consumption patterns over time using `gganimate` package
Beer Production	1:03:40	Getting the year being animated to show up in the title of a `gganimate` map
Beer Production	1:05:40	Summary of screencast
Beer Production	1:06:50	Spotting a mistake in a `group_by` call causing the percentages not to add up properly
Beer Production	1:09:10	Brief extra overview of `tidymetrics` code

Riddler: Simulating a Non-increasing Sequence

Back to summary

Screencast	Time	Description
Riddler: Simulating a Non-increasing Sequence	2:20	Introducing `accumulate` functon as a possible solution (but not used here)
Riddler: Simulating a Non-increasing Sequence	3:20	Using `sample` function to simulate 1000 rolls of a 10-sided die
Riddler: Simulating a Non-increasing Sequence	3:40	Explanation of dividing sample rolls into streaks (instead of using logic similar to a while loop)
Riddler: Simulating a Non-increasing Sequence	4:55	Using `cumsum` function to separate 1000 rolls into individual sequences (which end when a 0 is rolled)
Riddler: Simulating a Non-increasing Sequence	5:50	Using `lag` function to "shift" sequence numbering down by one row
Riddler: Simulating a Non-increasing Sequence	7:35	Using `cummax` and `lag` functions to check whether a roll is less than the highest value rolled previously in the sequence
Riddler: Simulating a Non-increasing Sequence	9:30	Fixing previous step with `cummin` function (instead of `cummax`) and dropping the `lag` function
Riddler: Simulating a Non-increasing Sequence	13:05	Finished simulation code and starting to calculate scores
Riddler: Simulating a Non-increasing Sequence	13:10	Using -`row_number` function (note the minus sign!) to calculate decimal position of number in the score
Riddler: Simulating a Non-increasing Sequence	15:30	Investigating the distribution of scores
Riddler: Simulating a Non-increasing Sequence	16:25	Using `seq` function in the `breaks` argument of `scale_x_continuous` to set custom, evenly-spaced axis ticks and labels

Tour de France

Back to summary

Screencast	Time	Description
Tour de France	3:55	Getting an overview of the data
Tour de France	8:55	Aggregating data into decades using the truncated division operator `%/%`
Tour de France	21:50	Noting that death data is right-censored (i.e., some winners are still alive)
Tour de France	24:05	Using `transmute` function, which combines functionality of `mutate` (to create new variables) and `select` (to choose variables to keep)
Tour de France	25:30	Using `survfit` function from `survival` package to conduct survival analysis
Tour de France	27:30	Using `glance` function from `broom` package to get a one-row model summary of the survival model
Tour de France	31:00	Using `extract` function to pull out a string matching a regular expression from a variable (stage number in this case)
Tour de France	34:30	Theorizing that there is a parsing issue with the original data's time field
Tour de France	41:15	Using `group_by` function's built-in "peeling" feature, where a `summarise` call will "peel away" one group but left other groupings intact
Tour de France	42:05	Using `rank` function, then upgrading to `percent_rank` function to give percentile rankings (between 0 and 1)
Tour de France	47:50	Using `geom_smooth` function with `method` argument as "lm" to plot a linear regression
Tour de France	48:10	Using `cut` function to bin numbers (percentiles in this case) into categories
Tour de France	50:25	Reviewing boxplots exploring relationship between first-stage performance and overall Tour performance
Tour de France	51:30	Starting to create an animation using `gganimate` package
Tour de France	56:00	Actually writing the code to create the animation
Tour de France	58:20	Using `reorder_within` function from `tidytext` package to re-order factors that have the same name across multiple groups
Tour de France	1:02:40	Summary of screencast

Riddler: Simulating a Branching Process

Back to summary

Screencast	Time	Description
Riddler: Simulating a Branching Process	0:35	Explanation of a Poisson process
Riddler: Simulating a Branching Process	2:40	Asking "How long do you have to wait for X to happen?", which the Exponential distribution can answer
Riddler: Simulating a Branching Process	4:20	Using `rexp` function to generate numbers from the Exponential distribution
Riddler: Simulating a Branching Process	5:25	Using a vector of rates inside the `rexp` function (to explore consecutive waiting times)
Riddler: Simulating a Branching Process	7:05	Using `cumsum` function to calculate total waiting time until hitting a specific number in the Poisson process
Riddler: Simulating a Branching Process	7:35	Using `which` function to determine the first instance > 3 in a vector
Riddler: Simulating a Branching Process	9:20	Using `replicate` function to do a quick simulation of the function just written
Riddler: Simulating a Branching Process	10:55	Discussing methods of making the simulation function faster
Riddler: Simulating a Branching Process	12:00	Using `crossing` function to set up "tidy" simulation (gives you all possible combinations of values you provide it)
Riddler: Simulating a Branching Process	13:15	Noting how the consecutive waiting times seems to follow the Harmonic series
Riddler: Simulating a Branching Process	17:10	Noticing that we are missing trials with 0 comments and fixing
Riddler: Simulating a Branching Process	20:25	Using `nls` function (non-linear least squares) to test how well the data fits with an exponential curve
Riddler: Simulating a Branching Process	23:05	Visualizing fit between data and the exponential curve calculated with `nls` in previous step
Riddler: Simulating a Branching Process	23:50	Using `augment` function to added fitted values of the `nls` function
Riddler: Simulating a Branching Process	26:00	Exploring whether the data actually follows a Geometric distribution
Riddler: Simulating a Branching Process	30:55	Explanation of the Geometric distribution as it applies to this question
Riddler: Simulating a Branching Process	34:05	Generalizing the question to ask how long it takes to get to multiple comments (not just 3)
Riddler: Simulating a Branching Process	38:45	Explanation of why we subtract 1 when fitting an exponential curve
Riddler: Simulating a Branching Process	46:00	Summary of screencast

GDPR Violations

Back to summary

Screencast	Time	Description
GDPR Violations	4:05	Use the `mdy` function from the `lubridate` package to change the date variable from `character` class to `date` class.
GDPR Violations	5:35	Use the `rename` function from the `dplyr` package to rename variable in the dataset.
GDPR Violations	6:15	Use the `fct_reorder` function from the `forcats` package to sort the `geom_col` in descending order.
GDPR Violations	6:30	Use the `fct_lump` function from the `forcats` package within `count` to lump together country names except for the 6 most frequent.
GDPR Violations	7:05	Use the `scale_x_continuous` function from `ggplot2` with the `scales` package to change the x-axis values to dollar format.
GDPR Violations	8:15	Use the `month` and `floor_date` function from the `lubridate` package to get the month component from the `date` variable to count the total fines per month.
GDPR Violations	8:55	Use the `na_if` function from the `dplyr` package to convert specific date value to `NA`.
GDPR Violations	11:05	Use the `fct_reorder` function from the `forcats` package to sort the stacked `geom_col` and legend labels in descending order.
GDPR Violations	15:15	Use the `dollar` function from the `scales` package to convert the `price` variable into dollar format.
GDPR Violations	15:40	Use the `str_trunc` to shorten the `summary` string values to 140 characters.
GDPR Violations	17:35	Use the `separate_rows` function from the `tidyr` package with a `regular expression` to separate the values in the `article_violated` variable with each matching group placed in its own row.
GDPR Violations	19:30	Use the `extract` function from the `tidyr` package with a `regular expression` to turn each matching group into a new column.
GDPR Violations	27:30	Use the `geom_jitter` function from the `ggplot2` package to add points to the horizontal box plot.
GDPR Violations	31:55	Use the `inner_join` function from the `dplyr` package to join together `article_titles` and `separated_articles` tables.
GDPR Violations	32:55	Use the `paste0` function from `base R` to concatenate `article` and `article_title`.
GDPR Violations	38:48	Use the `str_detect` function from the `stringr` package to detect the presence of a pattern in a string.
GDPR Violations	40:25	Use the `group_by` and `summarize` functions from the `dplyr` package to aggregate fines that were issued to the same country on the same day allowing for size to be used in `geom_point` plot.
GDPR Violations	41:14	Use the `scale_size_continuous` function from the `ggplot2` package to remove the size legend.
GDPR Violations	42:55	Create an interactive dashboard using the `shinymetrics` and `tidymetrics` which is a tidy approach to business intelligence.
GDPR Violations	47:25	Use the `cross_by_dimensions` and `cross_by_periods` functions from the `tidyr` package which stacks an extra copy of the table for each dimension specified as an argument (`country`, `article_title`, `type`), replaces the value of the column with the word `All` and `periods`, and groups by all the columns. It acts as an extended group_by that allows complete summaries across each individual dimension and possible combinations.

Broadway Musicals

Back to summary

Screencast	Time	Description
Broadway Musicals	8:15	Use the `cross_by_periods` function from the `tidymetrics` package to aggregate data over time (`month`, `quarter`, and `year`) then visualize with `geom_line`.
Broadway Musicals	14:00	Use the `cross_by_periods` function from the `tidymetrics` package with `windows = c(28))` to create a 4-week rolling average across `month`, `quarter`, and `year`.
Broadway Musicals	21:50	Create and `interactive dashboard` using the `shinymetrics` and `tidymetrics` packages.
Broadway Musicals	25:00	Use the `str_remove` function from the `stringr` package to remove matched pattern in a string.
Broadway Musicals	25:20	Use the `cross_by_dimensions` function from the `tidymetrics` package which acts as an extended `group_by` that allows complete summaries across each individual dimension and possible combinations.
Broadway Musicals	41:25	Use the `shinybones` package to create an interactive dashboard to visualize all 3 metrics at the same time.

Riddler: Simulating and Optimizing Coin Flipping

Back to summary

Screencast	Time	Description
Riddler: Simulating and Optimizing Coin Flipping	2:15	Using `crossing` function to set up "tidy" simulation (gives you all possible combinations of values you provide it)
Riddler: Simulating and Optimizing Coin Flipping	3:00	Using `rbinom` function to simulate the number of prisoners who choose to flip, then using `rbinom` again to simulate number of tails
Riddler: Simulating and Optimizing Coin Flipping	7:20	Using `dbinom` function (probability mass function) to see probabilities of any given number of prisoners choosing to flip
Riddler: Simulating and Optimizing Coin Flipping	10:15	Using `map_dbl` function to iterate a function, making sure to return a `dbl`-class object
Riddler: Simulating and Optimizing Coin Flipping	11:25	Using `seq_len(n)` instead of `1:n` to be slightly more efficient
Riddler: Simulating and Optimizing Coin Flipping	12:20	Using `optimise` function to conduct single-dimension optimisation (for analytical solution to this question)
Riddler: Simulating and Optimizing Coin Flipping	14:15	Using backticks (`like this`) for inline R functions in RMarkdown
Riddler: Simulating and Optimizing Coin Flipping	15:15	Starting the Extra Credit portion of the problem (N prisoners instead of 4)
Riddler: Simulating and Optimizing Coin Flipping	16:30	Using `map2_dbl` function to iterate a function that requires two inputs (and make sure it returns a `dbl`-class object)
Riddler: Simulating and Optimizing Coin Flipping	20:05	Reviewing visualisation of probabilties with a varying numbers of prisoners
Riddler: Simulating and Optimizing Coin Flipping	21:30	Tweaking graph to look nicer
Riddler: Simulating and Optimizing Coin Flipping	22:00	Get the exact optimal probability value for each number of prisoners
Riddler: Simulating and Optimizing Coin Flipping	22:45	Troubleshooting `optimise` function to work when iterated over different numbers of prisoners
Riddler: Simulating and Optimizing Coin Flipping	23:45	Using `unnest_wider` function to disaggregate a list, but put different elements on separate columns (not separate rows, which `unnest` does
Riddler: Simulating and Optimizing Coin Flipping	25:30	Explanation of what happens to probabilities as number of prisoners increases

Animal Crossing

Back to summary

Screencast	Time	Description
Animal Crossing	5:05	Starting text analysis of critic reviews of Animal Crossing
Animal Crossing	7:50	Using `floor_date` function from `lubridate` package to round dates down to nearest month (then week)
Animal Crossing	9:00	Using `unnest_tokens` function and `anti_join` functions from `tidytext` package to break reviews into individual words and remove stop words
Animal Crossing	10:35	Taking the average rating associated with individual words (simple approach to gauge sentiment)
Animal Crossing	12:30	Using `geom_line` and `geom_point` to graph ratings over time
Animal Crossing	14:40	Using `mean` function and logical statement to calculate percentages that meet a certain condition
Animal Crossing	22:30	Using `geom_text` to visualize what words are associated with positive/negative reviews
Animal Crossing	27:00	Disclaimer that this exploration is not text regression -- wine ratings screencast is a good resource for that
Animal Crossing	28:30	Starting to do topic modelling
Animal Crossing	30:45	Explanation of `stm` function from `stm` package
Animal Crossing	34:30	Explanation of `stm` function's output (topic modelling output)
Animal Crossing	36:55	Changing the number of topics from 4 to 6
Animal Crossing	37:40	Explanation of how topic modelling works conceptually
Animal Crossing	40:55	Using `tidy` function from `broom` package to find which "documents" (reviews) were the "strongest" representation of each topic
Animal Crossing	44:50	Noting that there might be a scraping issue resulting in review text being repeated
Animal Crossing	46:05	(Unsuccessfully) Using `str_sub` function to help fix repeated review text by locating where in the review text starts being repeated
Animal Crossing	48:20	(Unsuccessfully) Using `str_replace` and `map2_chr` functions, as well as regex cpaturing groups to fix repeated text
Animal Crossing	52:00	Looking at the association between review grade and gamma of the topic model (how "strong" a review represents a topic)
Animal Crossing	53:55	Using `cor` function with method = "spearman" to calculate correlation based on rank instead of actual values
Animal Crossing	57:35	Summary of screencast

Volcano Eruptions

Back to summary

Screencast	Time	Description
Volcano Eruptions	7:00	Change the `last_eruption_year` into `years_ago` by using `mutate` from the `dplyr` package with `years_ago = 2020 - as.numeric(last_eruption_year))`. In the plot David includes `+1` to account for 0 values in the `years_ago` variable.
Volcano Eruptions	9:50	Use `str_detect` from the `stringr` package to search the `volcano_name` variable for `Vesuvius` when not sure if spelling is correct.
Volcano Eruptions	12:50	Use the `longitude` and `latitude` to create a world map showing where the volcanoes are located.
Volcano Eruptions	15:30	Use `fct_lump` from the`forcats` package to lump together all `primary_volcano_type` factor levels except for the `n` most frequent.
Volcano Eruptions	16:25	Use `str_remove` from the `stringr` package with the regular expression `"\$.\$"` to remove the parentheses.
Volcano Eruptions	18:30	Use the `leaflet` package to create an interactive map with popup information about each volcano.
Volcano Eruptions	24:10	Use `glue` from the `glue` package to create an `HTML` string by concatenating `volcano_name` and `primary_volcano_type` between `HTML <p></p> tags`.
Volcano Eruptions	27:15	Use the `DT` package to turn the `leaflet` popup information into a `datatable`.
Volcano Eruptions	31:40	Use `str_replace_all` fromt he `stringr` package to replace all the underscores `_` in `volcano_name` with space. Then use `str_to_title` from the `stringr` package to convert the `volcano_name` variable to title case.
Volcano Eruptions	32:05	Use `kable` with `format = HTML` from the `knitr` package instead of `DT` to make turning the data into `HTML` much easier.
Volcano Eruptions	34:05	Use `paste0` from `base` R to bold the `Volcano Name`, `Primary Volcano Type`, and `Last Eruption Year` in the `leaflet` popup.
Volcano Eruptions	34:50	Use `replace_na` from the `tidyr` package to replace `unknown` with `NA`.
Volcano Eruptions	37:15	Use `addMeasure` from the `leaflet` package to add a tool to the map that allows for the measuring of distance between points.
Volcano Eruptions	39:30	Use `colorNumeric` from the `leaflet` package to color the points based on their `population within 5km`. To accomplish this, David creates 2 new variables: 1) `transformed_pop` to get the population on a `log2` scale & 2) `pop_color` which uses the `colorNumeric` function to generate the color hex values based on `transformed_pop`.
Volcano Eruptions	46:30	Use the `gganimate` package to create an animated map.
Volcano Eruptions	48:45	Use `geom_point` from the `ggplot2` package with `size = .00001 * 10 ^ vei` so the size of the points are then proportional to the `volume` metrics provided in the `Volcano Eruption Index`. The metrics are in `Km^3`.
Volcano Eruptions	50:20	Use `scale_size_continuous` from the `ggplot2` package with `range = c(.1, 6)` to make the smaller points smaller and larger points larger.
Volcano Eruptions	50:55	Use `scale_color_gradient2` from the `ggplot2` package to apply color gradient to each point based on the volcano size and whether its low or high.
Volcano Eruptions	59:40	Summary of screencast while waiting for `gganimate` map to render. Also, brief discussion on using `transition_reveal` instead of `transition_time` to keep the point on the map instead of replacing them in each frame.

Beach Volleyball

Back to summary

Screencast	Time	Description
Beach Volleyball	5:30	Use `pivot_longer` from the `dplyr` package to pivot the data set from `wide` to `long`.
Beach Volleyball	7:20	Use `mutate_at` from the `dplyr` package with `starts_with` to change the class to `character` for all columns that start with `w_` and `l_`.
Beach Volleyball	8:00	Use `separate` from the `tidyr` package to separate the `name` variable into three columns with `extra = merge` and `fill = right`.
Beach Volleyball	10:35	Use `rename` from the `dplyr` package to rename `w_player1`, `w_player2`, `l_player1`, and `l_player2`.
Beach Volleyball	12:50	Use `pivot_wider` from the `dplyr` package to pivot the `name` variable from `long` to `wide`.
Beach Volleyball	15:15	Use `str_to_upper` to convert the `winner_loser` `w` and `l` values to uppercase.
Beach Volleyball	20:25	Add unique row numbers for each match using `mutate` with `row_number` from the `dplyr` package.
Beach Volleyball	21:20	Separate the `score` values into multiple rows using `separate_rows` from the `tidyr` package.
Beach Volleyball	22:45	Use `separate` from the `tidyr` package to actual scores into two columns, one for the winners score `w_score` and another for the losers score `l_score`.
Beach Volleyball	23:45	Use `na_if` from the `dplyr` package to change the `Forfeit or other` value from the `score` variable to `NA`.
Beach Volleyball	24:35	Use `str_remove` from the `stringr` package to remove scores that include `retired`.
Beach Volleyball	25:25	Determine how many times the winners score `w_score` is greter than the losers score `l_score` at least 1/3 of the time.
Beach Volleyball	28:30	Use `summarize` from the `dplyr` package to create the summary statistics including the `number of matches`, `winning percentage`, `date of first match`, `date of most recent match`.
Beach Volleyball	34:15	Use `type_convert` from the `readr` package to convert `character` class variables to `numeric`.
Beach Volleyball	35:00	Use `summarize_all` from the `dplyr` package to calculate the calculate which fraction of the data is not `NA`.
Beach Volleyball	42:00	Use `summarize` from the `dplyr` package to determine players `number of matches`, `winning percentage`, `average attacks`, `average errors`, `average kills`, `average aces`, `average serve errors`, and `total rows with data` for years prior to 2019. The summary statistics are then used to answer how would we could predict if a player will win in 2019 using `geom_point` and `logistic regression`. Initially, David wanted to predict performance based on players first year performance. (NOTE - David mistakingly grouped by `year` and `age`. He cathces this around 1:02:00.)
Beach Volleyball	49:25	Use `year` from the `lubridate` package within a `group_by` to determine the `age` for each play given their `birthdate`.
Beach Volleyball	54:30	Turn the summary statistics at timestamp `42:00` into a `.` DOT `%>%` PIPE function.
Beach Volleyball	1:04:30	Summary of screencast

Cocktails

Back to summary

Screencast	Time	Description
Cocktails	6:20	Use `fct_reorder` from the `forcats` package to reorder the `ingredient` factor levels along `n`.
Cocktails	7:40	Use `fct_lump` from the `forcats` package to lump together all the levels except the `n` most frequent in the `category` and `ingredient` variables.
Cocktails	11:30	Use `pairwise_cor` from the `widyr` package to find the correlation between the `ingredients`.
Cocktails	16:00	Use `reorder_within` from the `tidytext` package with `scale_x_reordered` to reorder the the columns in each `facet`.
Cocktails	19:45	Use the `ggraph` and `igraph` packages to create a `network diagram`
Cocktails	25:15	Use `extract` from the `tidyr` package with `regex = (.*) oz` to create a new variable `amount` which doesn't include the `oz`.
Cocktails	26:40	Use `extract` with `regex` to turn the strings in the new `amount` variable into separate columns for the `ones`, `numerator`, and `denominator`.
Cocktails	28:53	Use `replace_na` from the `tidyr` package to replace `NA` with zeros in the `ones`, `numberator`, and `denominator` columns. David ends up reaplcing the `zero` in the `denominator` column with ones in order for the calculation to work.
Cocktails	31:49	Use `geom_text_repel` from the `ggrepel` package to add `ingredient` labels to the `geom_point` plot.
Cocktails	32:30	Use `na_if` from the `dplyr` package to replace `zeros` with `NA`
Cocktails	34:25	Use `scale_size_continuous` with `labels = percent_format()` to convert size legend values to percent.
Cocktails	36:35	Change the size of the points in the `network diagram` proportional to `n` using `vertices = ingredient_info` within `graph_from_data_frame` and `aes(size = n)` within `geom_node_point`.
Cocktails	48:05	Use `widely_svd` from the `widyr` package to perform principle component analysis on the `ingredients`.
Cocktails	52:32	Use `paste0` to concatenate `PC` and `dimension` in the facet panel titles.
Cocktails	57:00	Summary of screencast

African-American Achievements

Back to summary

Screencast	Time	Description
African-American Achievements	8:20	Use `fct_reorder` from the `forcats` package to reorder the `category` factor levels by sorting along `n`.
African-American Achievements	11:35	Use `str_remove` from the `stringr` package to remove anything after a bracket or parenthesis from the `person` variable with the `regular expression` `"[\\[\\(].*"` David then discusses how web scraping may be a better option than parsing the strings.
African-American Achievements	12:25	Use `str_trim` from the `stringr` package to remove the `whitespace` from the `person` variable. David then discusses how web scraping may be a better option than parsing the strings.
African-American Achievements	15:50	Create an interactive `plotly` timeline.
African-American Achievements	18:20	Use `ylim(c(-.1, 1))` to set scale limits moving the `geom_point` to the bottom of the graph.
African-American Achievements	19:30	Use `paste0` from `base R` to concatenate the `accomplishment` and `person` with `": "` in between the two displayed in the timeline hover label.
African-American Achievements	20:30	Set `y` to `category` in `ggplot` `aesthetics` to get 8 separate timelines on one plot, one for each category. Doing this allows David to remove the `ylim` mentioned above.
African-American Achievements	22:25	Use the `plotly` `tooltip = text` parameter to get just a single line of text in the `plotly` hover labels.
African-American Achievements	26:05	Use `glue` from the `glue` package to reformat `text` with `\n` included so that the single line of text can now be broken up into 2 separate lines in the hover labels.
African-American Achievements	33:55	Use `separate_rows` from the `tidyr` package to separate the `occupation_s` variable from the `science` dataset into multiple columns delimited by a semicolon with `sep = "; "`
African-American Achievements	34:25	Use `str_to_title` from the `stringr` package to conver the case to title case in the `occupation_s` variable.
African-American Achievements	35:15	Use `str_detect` from the `stringr` package to detect the presence of `statistician` from within the `occupation_s` variable with `regex("statistician", ignore_case = TRUE)` to perform a case-insensitive search.
African-American Achievements	41:55	Use the `rvest` package with `Selector Gadget` to scrape additional information about the individual from their `Wikipedia` infobox.
African-American Achievements	49:15	Use `map` and `possibly` from the `purrr` package to separate out the downloading of data from parsing the useful information. David then turns the infobox extraction step into an `anonymous function` using `.%>%` dot-pipe.
African-American Achievements	58:40	Summary of screencast

African-American History

Back to summary

Screencast	Time	Description
African-American History	6:55	Use `fct_lump` from the `forcats` package to lump together all the factor levels in `ship_name` except the `n` most frequent. Used within `filter` with `! = "Other"` to remove `other`.
African-American History	8:00	use `fct_reorder` from the `forcats` package to reorder the `ship_name` factor levels y sorting along the `n_slaves_arrived` variable.
African-American History	10:20	Add `geom_vline` to `geom_histogram` to annotate the plot with a vertical line indicating the Revolutionary War and the Civil War.
African-American History	13:00	Use `truncated division` within `count` to create a new `decade` variable equal to `10 * (year_arrival %/% 10))`
African-American History	17:20	Use `str_trunc` from the `stringr` package to truncate the titles in each facet panel accounting for the slave ports with really long names.
African-American History	18:05	Another option for accounting for long titles in the facet panels is to use `strip.text` within `theme` with `element_text(size = 6)`
African-American History	26:55	Use the `ggraph` package to create a `network diagram` using `port_origin` and `port_arrival`.
African-American History	29:05	Use `arrow` from the `grid` package to add directional arrows to the points in the `network diagram`.
African-American History	29:40	Use `scale_width_size_continuous` from the `ggraph` packge to adjust the size of the points in the `network diagram`.
African-American History	35:25	Within `summarize` use `mean(n_slaves_arrived, na.rm = TRUE) * n())` to come up with an estimated total numer of slaves since 49% of the data is missing.
African-American History	48:20	Create a faceted stacked percent barplot (spinogram) showing the percentage of `black_free`, `black_slaves`, `white`, and `other` for each region.
African-American History	51:00	Use the `wordcloud` package to create a `wordcloud` with the `african_names` dataset. David hsa issues with the `wordcloud` package and opts to use `ggwordcloud` instead. Also, mentions the `worldcloud2` package.
African-American History	55:20	Use `fct_recode` from the `forcats` package to change the factor levels for the `gender` variable while renaming `Man = "Boy"` and `Woman = "Girl"`
African-American History	57:20	Use `reorder_within` from the `tidytext` package to reorder the `geom_col` by `n` within `gender` variable for each facet panel.
African-American History	59:00	Summary of screencast

Caribou Locations

Back to summary

Screencast	Time	Description
Caribou Locations	4:00	Use `summarize` and `across` to calculate the proportion of `NA` values in the `individuals` dataset. Note, you do not need to use `list()`.
Caribou Locations	9:00	Use `ggplot` and `borders` from the `ggplot2` package to create a map of `Canada` with `deploy_on_longitude` and `deploy_on_latitude` from the `individuals` dataset.
Caribou Locations	13:50	Import Canada province `shapefile` using the `sf` package. [Unsuccessful]
Caribou Locations	25:00	Use `min` and `max` from `base r` within `summarize` to find out the `start` and `end` dates for each caribou in the `locations` dataset.
Caribou Locations	27:15	Use `sample` from `base r` to pick one single caribou at a time then use the subset with `geom_path` from `ggplot2` to track the path a that caribou takes over time. `color = factor(floor_date(timestamp, "quarter")` is used to color the path according to what quarter the observation occured in.
Caribou Locations	35:15	Use `as.Date` from `base r` and `floor_date` from the `lubridate` package to convert `timestamp` variable into quarters then `facet_wrap` the previous plot by `quarter`.
Caribou Locations	37:15	Within `mutate`, use `as.numeric(difftime(timestamp, lag(timestamp), unit = "hours"))` from `base r` to figure out the gap in time between observations.
Caribou Locations	43:05	Use `distHaversine` from the `geosphere` package to calculate distance in `km` then convert it to speed in `kph`.
Caribou Locations	1:00:00	Summary of dataset.

X-Men Comics

Back to summary

Screencast	Time	Description
X-Men Comics	07:25	Using `separate` to separate the name from secrete identity in the `character` column
X-Men Comics	09:55	Using `summarize` and `across` to find the frequency of the action variables and find out how many issues each action was used for each character
X-Men Comics	13:25	Create a `geom_col` chart to visualize which character speaks in the most issues
X-Men Comics	18:35	Create a `geom_point` chart to visualize each character’s average lines per issue in which the character is depicted
X-Men Comics	22:05	Create a `geom_point` chart to visualize each character’s average thoughts per issue in which the character is depicted
X-Men Comics	23:10	Create a `geom_point` chart to visualize character’s speech versus thought ratio per issue in which the character is depicted
X-Men Comics	30:05	Create a `geom_point` to visualize character’s number of lines while in costume versus not in costume
X-Men Comics	34:30	Create a `geom_point` chart to visualize the lines in costume versus lines out of costume ratio
X-Men Comics	39:20	Create a `lollipop graph` using `geom_point` and `geom_errorbarh` to visualize the lines in costume versus lines out of costume ratio and their distance from 1.0 (1 to 1)
X-Men Comics	45:00	Use `summarize` to find the frequency of each location and the total number of unique issues where the location is used
X-Men Comics	46:00	Use `summarize` and `fct_lump` to count how many issues each author has written while lumping together all authors except the most frequent
X-Men Comics	47:25	Use `summarize` and `fct_lump` to see if the authors rates of passing the Bechdel test differ from one another
X-Men Comics	52:45	Create a `geom_line` chart to visualize if the rates of passing the Bechdel test changed over time and `floor division` `%/%` to generate 20 observations per group
X-Men Comics	54:35	Create a `geom_col` to visualize the amount of lines each character has per issue over time giving context to Bechdel test passing rates
X-Men Comics	1:00:00	Summary of screencast

Coffee Ratings

Back to summary

Screencast	Time	Description
Coffee Ratings	08:15	Using `fct_lump` within `count` and then `mutate` to lump the variety of coffee together except for the most frequent
Coffee Ratings	08:50	Create a `geom_boxplot` to visualize the variety and the distribution of `total_cup_points`
Coffee Ratings	09:55	Create a `geom_histogram` to visualize the variety and the distribution of `total_cup_points`
Coffee Ratings	11:40	Using `fct_reorder` to reorder `variety` by sorting it along `total_cup_points` in ascending order
Coffee Ratings	12:35	Using `summarize` with `across` to calculate the percent of missing data (NA) for each rating variable
Coffee Ratings	15:20	Create a bar chart using `geom_col` with `fct_lump` to visualize the frequency of top countries
Coffee Ratings	20:35	Using `pivot_longer` to pivot the rating metrics for wide format to long format
Coffee Ratings	21:30	Create a `geom_line` chart to see if the `sum` of the rating categories equal to the `total_cup_points` column
Coffee Ratings	23:10	Create a `geom_density_ridges` chart to show the distribution of ratings across each rating metric
Coffee Ratings	24:35	Using `summarize` with `mean` and `sd` to show the average rating per metric with its standard deviation
Coffee Ratings	26:15	Using `pairwise_cor` to find correlations amongst the rating metrics
Coffee Ratings	27:20	Create a `network plot` to show the clustering of the rating metrics
Coffee Ratings	29:35	Using `widely_svd` to visualize the biggest source of variation with the rating metrics (Singular value decomposition)
Coffee Ratings	37:40	Create a `geom_histogram` to visualize the distribution of altitude
Coffee Ratings	40:20	Using `pmin` to set a maximum numeric altitude value of 3000
Coffee Ratings	41:05	Create a `geom-point` chart to visualize the correlation between altitude and quality (`total_cup_points`)
Coffee Ratings	42:00	Using `summarize` with `cor` to show the correlation between altitude and each rating metric
Coffee Ratings	44:25	Create a linear model `lm` for each rating metric then visualize the results using a `geom_line` chart to show how each kilometer of altitude contributes to the score
Coffee Ratings	50:35	Summary of screencast

Australian Animal Outcomes

Back to summary

Screencast	Time	Description
Australian Animal Outcomes	1:20	Using `use_tidytemplate` to open the project dataset with the package's tidytemplate Rmd
Australian Animal Outcomes	4:30	Using `rename` to rename `Total` column to `total`
Australian Animal Outcomes	6:20	Using `fct_reorder` to reorder stacked barplot with `weight = sum`
Australian Animal Outcomes	7:00	Using `fct_lump` with `w = n` to lump together `outcome` factor levels displaying the most frequenct with rest lumped into `other`
Australian Animal Outcomes	9:15	Using `fct_recode` to combine the factor level `In Stock` with `Currently In Care`
Australian Animal Outcomes	12:10	Using `fct_reorder` to reorder `facet_wrap` panels
Australian Animal Outcomes	13:03	Using `scale_y_continuous` with `labels = comma` to separate digits with comma
Australian Animal Outcomes	14:10	Using `complete` to complete account for missing combinations of data where the value is 0 in the `released` column
Australian Animal Outcomes	16:10	Using `max (year)` within `filter` to subset the data displaying only the most recent year
Australian Animal Outcomes	19:30	Using `pivot_longer` to pivot location variables from wide to long
Australian Animal Outcomes	21:45	Web Scaraping table from Wikipedia with `SelectorGadget` and `Rvest`
Australian Animal Outcomes	25:45	Using `str_to_upper` to upper case the values in the `shorthand` column
Australian Animal Outcomes	27:13	Using `parse_number` to remove commas from `population` and `area` columns
Australian Animal Outcomes	28:55	Using `bind_rows` to bind the two web scraped tables from Wikipedia together by row and column
Australian Animal Outcomes	29:35	Using `inner_join` to combine the Wikipedia table with the original data set
Australian Animal Outcomes	29:47	Using `mutate` to create new `per_capita_million` column to show `outcome` on a per million people basis
Australian Animal Outcomes	37:25	Using `summarize` to create new column `pct_euthanized` showing percent of cats and dogs euthanized over time. Formula accounts for 0 values thus avoiding a resulting empty vector.
Australian Animal Outcomes	39:10	Using `scale_y_continuous` with `labels = percent` to add percentage sign to y-axis values
Australian Animal Outcomes	42:45	Create a choropleth map of Australia using an Australian States `Shapefile` using the `sf` and `ggplot2` packages
Australian Animal Outcomes	55:45	Add animation to the map of Australia showing the percent of cats euthanized by region using `gganimate`
Australian Animal Outcomes	1:01:35	Summary of screencast

Palmer Penguins

Back to summary

Screencast	Time	Description
Palmer Penguins	11:17	Create a pivoted histogram plot to visualize the distribution of penguin metrics using `pivot_longer`, `geom_histogram`, and `facet_wrap`
Palmer Penguins	14:40	Create a pivoted density plot to visualize the distribution of penguin metrics using `geom_density` and `facet_wrap`
Palmer Penguins	15:21	Create a pivoted boxplot plot to visualize the distribution of penguin metrics using `geom_boxplot` and `facet_wrap`
Palmer Penguins	17:50	Create a bar plot to show penguin species changed over time
Palmer Penguins	18:25	Create a bar plot to show specie counts per island
Palmer Penguins	20:00	Create a logistic regression model to predict if a penguin is Adelie or not using bill length with cross validaiton of metrics
Palmer Penguins	39:35	Create second logistic regression model using 4 predictive metrics (bill length, bill depth, flipper length, body mass) and then compare the accuracy of both models
Palmer Penguins	43:25	Create a k-nearest neighbor model and then compare accuracy against logistic regression models to see which has the highest cross validated accuracy
Palmer Penguins	53:05	What is the accuracy of the testing holdout data on the k-nearest neighbor model?
Palmer Penguins	1:05:40	Create a decision tree and then compare accuracy against the previous models to see which has the highest cross validated accuracy + how to extract a decision tree
Palmer Penguins	1:10:45	Perform multi class regression using `multinom_reg`
Palmer Penguins	1:19:40	Summary of screencast

European Energy

Back to summary

Screencast	Time	Description
European Energy	01:50	Using `count` to get an overview of scategorical data
European Energy	07:25	Using `pivot_longer` and `gather` to pivot date variables from wide to long
European Energy	09:00	Using `as.integer` to change `year` variable from `character` to `integer` class
European Energy	10:10	Using `fct_reorder` to reorder stacked barplot
European Energy	10:30	Using `scale_y_continuous` with `labels = comma` from `scales` package to insert a comma every three digits on the y-axis
European Energy	16:35	Using `replace_na` and `list` to replace `NA` values in `country_name` column with United Kingdom
European Energy	18:05	Using `fct_lump` to lump factor levels together except for the 10 most frequent for each facet panel
European Energy	20:10	Using `reorder_within` with `fun = sum` and `scale_y_reordered` to reorder the categories within each facet panel
European Energy	24:30	Using `ggflags` package to add country flags
European Energy	29:20	(Unsuccessfully) Using `fct_recode` to rename the ISO two-digit identifier for the United Kingdom from the UK to GB
European Energy	33:20	Using `ifelse` to replace the ISO two-digit identifier for the United Kingdom from UK to GB & from EL to GR fro Greece
European Energy	40:45	Using `str_to_lower` to convert observations in `country` column to lower case
European Energy	45:00	Creating a `slope graph` to show differences in Nuclear production (2106 versus 2018)
European Energy	47:00	Using `scale_x_continuous` with `breaks = c(2016, 2018)` to show only 2016 and 2018 on x-axis
European Energy	48:20	Extend x-axis limits using `scale_x_continuous` with `limits = c(2015, 2019)` and `geom_text` with an `ifelse` within `hjust` to alternate labels for the right and left side of `slope graph`
European Energy	52:40	Creating a slopegraph function
European Energy	1:00:00	Summary of screencast

Plants in Danger

Back to summary

Screencast	Time	Description
Plants in Danger	2:00	Getting an overview of categorical data
Plants in Danger	5:00	Using `fct_relevel` to reorder the "Before 1900" level to the first location leaving the other levels in their existing order
Plants in Danger	8:05	Using `n` and `sum` in `fct_reorder` to reorder factor levels when there are multiple categories in `count`
Plants in Danger	12:00	Using `reorder_within` and `scale_y_reordered` such that the values are ordered within each facet
Plants in Danger	14:55	Using `axis.text.x` to rotate overlapping labels
Plants in Danger	19:05	Using `filter` and `fct_lump` to lump all levels except for the 8 most frequest facet panels
Plants in Danger	26:55	Using `separate` to separate the character column `binomial_name` into multiple columns (genus and species)
Plants in Danger	28:20	Using `fct_lump` within `count` to lump all levels except for the 8 most frequent genus
Plants in Danger	45:30	Using `rvest` and `SelectorGadget` to web scrape list of species
Plants in Danger	49:35	Using `str_trim` to remove whitespace from character string
Plants in Danger	50:00	Using `separate` to separate character string into genus, species, and rest/citation columns and using `extra = "merge"` to merge extra pieces into the rest/citation column
Plants in Danger	51:00	Using `rvest` and `SelectorGadget` to web scrape image links
Plants in Danger	57:50	Summary of screencast

Chopped

Back to summary

Screencast	Time	Description
Chopped	5:20	Use `geom_histogram` to visualize the distribution of episode ratings.
Chopped	6:30	Use `geom_point` and `geom_line` with `color = factor(season)` to visualize the episode rating for every episode.
Chopped	7:15	Use `group_by` and `summarize` to show the average rating for each season and the number of episodes in each season.
Chopped	7:15	Use `geom_line` and `geom_point` with `size = n_episodes` to visualize the average rating for each season with point size indicating the total number of episodes (larger = more episodes, smaller = fewer episodes).
Chopped	10:55	Use `fct_reorder` to reorder the `episode_name` factor levels by sorting along the `episode_rating` variable.
Chopped	10:55	Use `geom_point` to visualize the top episodes by rating. Use the 'glue' package to place `season number` and `episode number` before episode name on the `y axis`.
Chopped	15:20	Use `pivot_longer` to combine ingredients into one single column. Use `separate_rows` with `sep = ", "` to separate out the ingredients with each ingredient getting its own row.
Chopped	18:10	Use `fct_lump` to lump ingredients together except for the 10 most frequent. Use `fct_reorder` to reorder `ingredient` factor levels by sorting against `n`.
Chopped	18:10	Use `geom_col` to create a stacked bar plot to visualize the most common ingredients by course.
Chopped	19:45	Use `fct_relevel` to reorder `course` factor levels to appetizer, entree, dessert.
Chopped	21:00	Use `fct_rev` and `scale_fill_discrete` with `guide = guide_legend(reverse = TRUE)` to reorder the segments within the stacked bar plot.
Chopped	23:20	Use the `widyr` package and `pairwise_cor` to find out what ingredients appear together. Mentioned: David Robinson - The `widyr` Package YouTube Talk at 2020 R Conference
Chopped	26:20	Use `ggraph` , `geom_edge_link`, `geom_node_point`, `geom_node_text` to create an ingredient network diagram to show their makeup and how they interact.
Chopped	28:00	Use `pairwise_count` from `widyr` to count the number of times each pair of items appear together within a group defined by feature.
Chopped	30:15	Use `unite` from the `tidyr` package in order to paste together the `episode_course` and `series_episode` columns into one column to figure out if any pairs of ingredients appear together in the same course across episodes.
Chopped	31:55	Use `summarize` with `min`, `mean`, `max`, and `n()` to create the `first_season`, `avg_season`, `last_season` and `n_appearances` variables.
Chopped	34:35	Use `slice` with `tail` to get the `n` ingredients that appear in early and late seasons.
Chopped	35:40	Use `geom_boxplot` to visualize the distribution of each ingredient across all seasons.
Chopped	36:50	Fit predictive models (`linear regression` , `random forest`, and `natural spline`) to determine if episode rating is explained by the ingredients or season. Use `pivot_wider` with `values_fill = list(value = 0))` with 1 indicating ingredient was used and 0 indicating it wasn't used.
Chopped	1:17:25	Summary of screencast

Global Crop Yields

Back to summary

Screencast	Time	Description
Global Crop Yields	03:35	Using `rename` to shorten column name
Global Crop Yields	06:40	Using `rename_all` with `str_remove` and regex to remove characters in column name
Global Crop Yields	07:40	Using `pivot_longer` to change data from wide to long
Global Crop Yields	08:25	Create a faceted `geom_line` chart
Global Crop Yields	09:40	Using `fct_reorder` to reorder facet panels in ascending order
Global Crop Yields	11:50	Create an interactive `Shiny` dashboard
Global Crop Yields	33:20	Create a faceted `geom_line` chart with `add_count` and `filter(n = max(x))` to subset the data for crops that have observations in every year
Global Crop Yields	36:50	Create a faceted `geom_point` chart showing the crop yields at start and end over a 50 year period (1968 start date and 2018 end date)
Global Crop Yields	45:00	Create a `geom_boxplot` to visualize the distribution of yield ratios for the different crops to see how efficiency has increased across countries
Global Crop Yields	46:00	Create a `geom_col` chart to visualize the median yield ratio for each crop
Global Crop Yields	47:50	Create a `geom_point` chart to visualize efficiency imporvement for each country for a specific crop (yield start / yield ratio)
Global Crop Yields	50:25	Using the `countrycode` package to color `geom_point` chart by continent names
Global Crop Yields	56:50	Summary of screencast

Friends

Back to summary

Screencast	Time	Description
Friends	7:30	Use `dplyr` package's `count` function to count the unique values of multiple variables.
Friends	9:35	Use `geom_col` to show how many lines of dialogue there is for each character. Use `fct_reorder` to reorder the `speaker` factor levels by sorting along `n`.
Friends	12:07	Use `semi_join` to join `friends` dataset with `main_cast` with `by = ""speaker` returning all rows from `friends` with a match in `main_cast`.
Friends	12:30	Use `unite` to create the `episode_number` variable which pastes together `season` and `episode` with `sep = "."`. Then, use `inner_join` to combine above dataset with `friends_info` with `by = c("season", "episode")`. Then, use `mutate` and the `glue` package instead to combine `{ season }.{ episode } { title }`. Then use `fct_reorder(episode_title, season + .001 * episode)` to order it by `season` first then `episode`.
Friends	15:45	Use `geom_point` to visualize `episode_title` and `us_views_millions`. Use `as.integer` to change `episode_title` to integer class. Add labels to `geom_point` using `geom_text` with `check_overlap = TRUE` so text that overlaps previous text in the same layer will not be plotted.
Friends	19:95	Run the above plot again using `imdb_rating` instead of `us_views_millions`
Friends	21:35	Ahead of modeling: Use `geom_boxplot` to visualize the distribution of speaking for main characters. Use the `complete` function with `fill = list(n = 0)` to replace existing explicit missing values in the data set. Demonstration of how to account for missing `imdb_rating` values using the `fill` function with `.direction = "downup"` to keep the imdb rating across the same title.
Friends	26:45	Ahead of modeling: Use `summarize` with `cor(log2(n), imdb_rating)` to find the correlation between speaker and imdb rating -- the fact that the correlation is positive for all speakers gives David a suspicion that some episodes are longer than others because they're in 2 parts with higher ratings due to important moments. David addresses this `confounding factor` by including `percentage of lines` instead of `number of lines`. Visualize results with `geom_boxplot`, `geom_point` with `geom_smooth`.
Friends	34:05	Use a `linear model` to predict imdb rating based on various variables.
Friends	42:00	Use the `tidytext` and `tidylo` packages to see what words are most common amongst characters, and whether they are said more times than would be expected by chance. Use `geom_col` to visualize the most overrepresented words per character according to `log_odds_weighted`.
Friends	54:15	Use the `widyr` package and `pairwise correlation` to determine which characters tend to appear in the same scences together? Use `geom_col` to visualize the correlation between characters.
Friends	1:00:25	Summary of screencast

Government Spending on Kids

Back to summary

Screencast	Time	Description
Government Spending on Kids	6:15	Using `geom_line` and `summarize` to visualize education spending over time. First for all states. Then individual states. Then small groups of states using `%in%`. Then in random groups of size n using `%in%` and `sample` with `unique`. `fct_reorder` is used to reorder `state` factor levels by sorting along the `inf_adj` variable. `geom_vline` used to add reference to the 2009 financial crisis.
"	Government Spending on Kids	16:00
Government Spending on Kids	23:35	Create a `function` named `plot_changed_faceted` to make it easier to visualize the many other variables included in the dataset.
Government Spending on Kids	27:25	Create a `function` named `plot_faceted` with a `{{ y_axis }}` embracing argument. Adding this function creates two stages: one for data transformation and another for plotting.
Government Spending on Kids	37:05	Use the `dir` function with `pattern` and `purrr` package's `map_df` function to read in many different `.csv` files with GDP values for each state. Troubleshooting `Can't combine <character> and <double> columns` error using `function` and `mutate` with `across` and `as.numeric`. Extract state name from filename using `extract` from `tidyr` and `regular expression`.
Government Spending on Kids	50:50	Unsuccessful attempt at importing state population data via a not user friendly dataset from `census.gov` by skipping the first 3 rows of the Excel file.
Government Spending on Kids	54:22	Use `geom_col` to see which states spend the most for each child for a single variable and multiple variables using `%in%`. Use `scale_fill_discrete` with `guide_legend(reverse = TRUE)` to change the ordering of the legend.
Government Spending on Kids	57:40	Use `geom_col` and `pairwise_corr` to visualize the correlation between variables across states in 2016 using `pairwise correlation`.
Government Spending on Kids	1:02:02	Use `geom_point` to plot `inf_adjust_perchild_PK12ed` versus `inf_adj_perchild_highered`. `geom_text` used to apply state names to each point.
Government Spending on Kids	1:05:00	Summary of screencast

Himalayan Climbers

Back to summary

Screencast	Time	Description
Himalayan Climbers	3:00	Create a `geom_col` chart to visualize the top 50 tallest mountains. Use `fct_reorder` to reorder the `peak_name` factor levels by sorting along the `height_metres` variable.
Himalayan Climbers	8:50	Use `summarize` with `across` to get the total number of climbs, climbers, deaths, and first year climbed. Use `mutate` to calculate the percent death rate for members and hired staff. Use `inner_join` and `select` to join with `peaks` dataset by `peak_id`.
Himalayan Climbers	11:20	Touching on statistical `noise` and how it impacts the death rate for mountains with fewer number of climbs, and how to account for it using various statistical methods including `Beta Binomial Regression` & `Empirical Bayes`.
Himalayan Climbers	14:30	Further description of `Empirical Bayes` and how to account for not overestimating death rate for mountains with fewer climbers. Recommended reading: Introduction to Empirical Bayes: Examples from Baseball Statistics by David Robinson.
Himalayan Climbers	17:00	Use the `ebbr` package (Empirical Bayes for Binomial in R) to create an Empirical Bayes Estimate for each mountain by fitting prior distribution across data and adjusting the death rates down or up based on the prior distributions. Use a `geom_point` chart to visualize the difference between the raw death rate and new `ebbr` fitted death rate.
Himalayan Climbers	21:20	Use `geom_point` to visualize how deadly each mountain is with `geom_errorbarh` representing the 95% credible interval between minimum and maximum values.
Himalayan Climbers	26:35	Use `geom_point` to visualize the relationship between `death rate` and `height` of mountain. There is not a clear relationship, but David does briefly mention how one could use `Beta Binomial Regression` to further inspect for possible relationships / trends.
Himalayan Climbers	28:00	Use `geom_histogram` and `geom_boxplot` to visualize the distribution of time it took climbers to go from basecamp to the mountain’s high point for successful climbs only. Use `mutate` to calculate the number of days it took climbers to get from basecamp to the highpoint. Add column to data using `case_when` and `str_detect` to identify strings in `termination_reason` that contain the word `Success` and rename them to `Success` & how to use a `vector` and `%in%` to change multiple values in `termination_reason` to `NA` and rest to `Failed`. Use `fct_lump` to show the top 10 mountains while lumping the other factor levels (mountains) into `other`.
Himalayan Climbers	35:30	For just Mount Everest, use `geom_histogram` and `geom_density` with `fill = success` to visualize the days from basecamp to highpoint for climbs that ended in `success`, `failure` or `other`.
Himalayan Climbers	38:40	For just Mount Everest, use `geom_histogram` to see the distribution of climbs per year.
Himalayan Climbers	39:55	For just Mount Everest, use ‘geom_line`and`geom_point`to visualize`pct_death`over time by decade. Use`mutate`with`pmax`and`integer division` to create a decade variable that lumps together the data for 1970 and before.
Himalayan Climbers	41:30	Write a function for summary statistics such as `n_climbs`, `pct_success`, `first_climb`, `pct_death`, ‘pct_hired_staff_death`.
Himalayan Climbers	46:20	For just Mount Everest, use `geom_line` and `geom_point` to visualize `pct_success` over time by decade.
Himalayan Climbers	47:10	For just Mount Everest, use `geom_line` and `geom_point` to visualize `pct_hired_staff_deaths` over time by decade. David decides to visualize the `pct_hired_staff_deaths` and `pct_death` charts together on the same plot.
Himalayan Climbers	50:45	For just Mount Everest, fit a logistic regression model to predict the probability of death with `format.pval` to calculate the `p.value`. Use `fct_lump` to lump together all `expedition_role` factors except for the n most frequent.
Himalayan Climbers	56:30	Use `group_by` with `integer division` and `summarize` to calculate `n_climbers` and `pct_death` for age bucketed into decades.
Himalayan Climbers	59:45	Use `geom_point` and `geom_errorbarh` to visualize the logistic regression model with confident intervals.
Himalayan Climbers	1:03:30	Summary of screencast

Beyoncé and Taylor Swift Lyrics

Back to summary

Screencast	Time	Description
Beyonce and Taylor Swift Lyrics	7:50	Use `fct_reorder` from the `forcats` package to reorder `title` factor levels by sorting along the `sales` variable in `geom_col` plot.
Beyonce and Taylor Swift Lyrics	8:10	Use `labels = dollar` from the `scales` package to format the `geom_col` x-axis values as currency.
Beyonce and Taylor Swift Lyrics	11:15	Use `rename_all(str_to_lower)` to convert variable names to lowercase.
Beyonce and Taylor Swift Lyrics	12:45	Use `unnest_tokens` from the `tidytext` package to split the lyrics into one-lyric-per-row.
Beyonce and Taylor Swift Lyrics	13:00	Use `anti_join` from the `tidytext` package to find the most common words int he lyrics without `stop_words`.
Beyonce and Taylor Swift Lyrics	15:15	Use `bind_tf_idf` from the `tidytext` package to determine `tf` - the proportion each word has in each album and `idf` - how specific each word is to each particular album.
Beyonce and Taylor Swift Lyrics	17:45	Use `reorder_within` with `scale_y_reordered` in order to reorder the bars within each `facet panel`. David replaces `top_n` with `slice_max` from the `dplyr` package in order to show the top 10 words with `ties = FALSE`.
Beyonce and Taylor Swift Lyrics	20:45	Use `bind_log_odds` from the `tidylo` package to calculate the `log odds ratio` of album and words, that is how much more common is the word in a specific album than across all the other albums.
Beyonce and Taylor Swift Lyrics	23:10	Use `filter(str_length(word) <= 3)` to come up with a list in order to remove common filler words like `ah`, `uh`, `ha`, `ey`, `eeh`, and `huh`.
Beyonce and Taylor Swift Lyrics	27:00	Use `mdy` from the `lubridate` package and `str_remove(released, " \\(.*)"))` from the `stringr` package to parse the dates in the `released` variable.
Beyonce and Taylor Swift Lyrics	28:15	Use `inner_join` from the `dplyr` package to join `taylor_swift_words` with `release_dates`. David ends up having to use `fct_recode` since the albums `reputation` and `folklore` were nor `lowercase` in a previous table thus excluding them from the `inner_join`.
Beyonce and Taylor Swift Lyrics	28:30	Use `fct_reorder` from the `forcats` package to reorder `album` factor levels by sorting along the `released` variable to be used in the `faceted` `geom_col`.
Beyonce and Taylor Swift Lyrics	34:40	Use `bind_rows` from hte `dplyr` package to bind `ts` with `beyonce` with `unnest_tokens` from the `tidytext` package to get one lyric per row per artist.
Beyonce and Taylor Swift Lyrics	38:40	Use `bind_log_odds` to figure out which words are more likely to come from a Taylor Swift or Beyonce song?
Beyonce and Taylor Swift Lyrics	41:10	Use `slice_max` from the `dplyr` package to select the top 100 words by `num_words_total` and then the top 25 by `log_odds_weighted`. Results are used to create a diverging bar chart showing which words are most common between Beyonce and Taylor Swift songs.
Beyonce and Taylor Swift Lyrics	44:40	Use `scale_x_continuous` to make the `log_odds_weighted` scale more interpretable.
Beyonce and Taylor Swift Lyrics	50:45	Take the previous plot and turn it into a `lollipop graph` with `geom_point(aes(size = num_words_total, color = direction))`
Beyonce and Taylor Swift Lyrics	53:05	Use `ifelse` to change the `1x` value on the x-axis to `same`.
Beyonce and Taylor Swift Lyrics	54:15	Create a `geom_point` with `geom_abline` to show the most popular words they use in common.
Beyonce and Taylor Swift Lyrics	1:01:55	Summary of screencast

NCAA Women's Basketball

Back to summary

Screencast	Time	Description
NCAA Women's Basketball	15:00	Use `fct_relevel` from the `forcats` package to order the factor levels for the `tourney_finish` variable.
NCAA Women's Basketball	16:35	Use `geom_tile` from the `ggplot2` package to create a `heatmap` to show how far a particular seed ends up going in the tournament.
NCAA Women's Basketball	20:35	Use `scale_y_continuous` from the `ggplot2` package with `breaks = seq(1, 16)` in order to include all 16 seeds.
NCAA Women's Basketball	20:55	Use `geom_text` from the `ggplot2` package with `label = percent(pct)` to apply the percentage to each tile in the heatmap.
NCAA Women's Basketball	21:40	Use `scale_x_discrete` and `scale_y_continuous` both with `expand = c(0, 0)` to remove the space between the x and y axis and the heatmap tiles. David calls this flattening.
NCAA Women's Basketball	32:15	Use `scale_y_reverse` to flip the order of the y-axis from 1-16 to 16-1.
NCAA Women's Basketball	34:45	Use `cor` from the `stats` package to calculate the `correlation` between `seed` and `tourney_finish`. Then plotted to determine if there is a correlation over time.
NCAA Women's Basketball	39:50	Use `geom_smooth` with `method = "loess"` to add a smoothing line with confidence bound to aid in seeing the trend between `seed` and `reg_percent`.
NCAA Women's Basketball	42:10	Use `fct_lump` from the `forcats` package to lump together all the conference except for the `n` most frequent.
NCAA Women's Basketball	42:55	Use `geom_jitter` from the `ggplot2` package instead of `geom_boxplot` to avoid overplotting which makes it easier to visualize the points that make up the distribution of the `seed` variable.
NCAA Women's Basketball	47:05	Use `geom_smooth` with `method = "lm"` to aid in seeing the trend between `reg_percent` and `tourney_w`.
NCAA Women's Basketball	54:20	Create a `dot pipe function` using `.` and `%>%` to avoid duplicating summary statistics with `summarize`.
NCAA Women's Basketball	56:35	Use `glue` from the `glue` package to concatenate together `school` and `n_entries` on the `geo_col` y-axis.
NCAA Women's Basketball	59:50	Summary of screencast

Great American Beer Festival

Back to summary

Screencast	Time	Description
Great American Beer Festival	8:20	Use `pivot_wider` with `values_fill = list(value =0))` from the `tidyr` package along with `mutate(value = 1)` to pivot the `medal` variable from `long` to `wide` adding a 1 for the medal type awarded and 0 for the remaining medal types in the row.
Great American Beer Festival	11:25	Use `fct_lump` from the `forcats` package to lump together all the beers except for the N most frequent.
Great American Beer Festival	12:25	Use `str_to_upper` from the `stringr` package to convert the case of the `state` variable to uppercase.
Great American Beer Festival	12:25	Use `fct_relevel` from the the `forcats` package in order to reorder the `medal` factor levels.
Great American Beer Festival	13:25	Use `fct_reorder` from the `forcats` package to sort `beer_name` factor levels by sorting along `n`.
Great American Beer Festival	14:30	Use `glue` from the `glue` package to concatenate `beer_name` and `brewery` on the y-axis.
Great American Beer Festival	15:00	Use `ties.mthod = "first"` within `fct_lump` to show only the first `brewery` when a tie exists between them.
Great American Beer Festival	19:25	Use `setdiff` from the `dplyr` package and the `state.abb` built in vector from the `datasets` package to check which states are missing from the dataset.
Great American Beer Festival	21:25	Use `summarize` from the `dplyr` package to calculate the `number of medals` with `n_medals = n()`, `number of beers` with `n_distinct`, `number of gold medals` with `sum()`, and `weighted medal totals` using `sum(as.integer()` because `medal` is an ordered factor, so 1 for each bronze, 2 for each silver, and 3 for each gold.
Great American Beer Festival	26:05	Import `Craft Beers Dataset` from `Kaggle` using `read_csv` from the `readr` package.
Great American Beer Festival	28:00	Use `inner_join` from the `dplyr` package to join together the 2 datasets from `kaggle`.
Great American Beer Festival	29:40	Use `semi_join` from the `dplyr` package to join together to see if the beer names match with the `kaggle` dataset. Ends up at a dead end with not enough matches between the datasets.
Great American Beer Festival	33:05	Use `bind_log_odds` from the `tidylo` package to show the representation of each beer category for each state compared to the categories across the other states.
Great American Beer Festival	33:35	Use `complete` from the `tidyr` package in order to turn missing values into explicit missing values.
Great American Beer Festival	35:30	Use `reorder_within` from the `tidytext` package and `scale_y_reordered` from the `tidytext` package in order to reorder the bars within each facet panel.
Great American Beer Festival	36:40	Use `fct_reorder` from the `forcats` package to reorder the `facet panels` in descending order.
Great American Beer Festival	39:35	For the previous plot, use `fill = log_odds_weighted > 0` in the `ggplot` `aes` argument to highlight the positive and negative values.
Great American Beer Festival	41:45	Use `add_count` from the `dplyr` package to add a `year_total` variable which shows the total awards for each year. Then use this to calculate the percent change in totals medals per state using `mutate(pct_year = n / year)`
Great American Beer Festival	44:40	Use `glm` from the `stats` package to create a `logistic regression` model to find out if their is a statistical trend in the probability of award success over time.
Great American Beer Festival	47:15	Exapnd on the previous model by using the `broom` package to fit multiple `logistic regressions` across multiple states instead of doing it for an individual state at a time.
Great American Beer Festival	50:25	Use `conf.int = TRUE` to add `confidence bounds` to the `logistic regression` output then use it to create a `TIE Fighter` plot to show which states become more or less frequent medal winners over time.
Great American Beer Festival	53:00	Use the `state.name` dataset with `match` from `base r` to change state abbreviation to the state name.
Great American Beer Festival	55:00	Summary of screencast

IKEA Furniture

Back to summary

Screencast	Time	Description
IKEA Furniture	4:30	Use `fct_reorder` from the `forcats` package to reorder the factor levels for `category` sorted along `n`.
IKEA Furniture	6:00	Brief explanation of why `scale_x_log10` is needed given the distribution of `category` and `price` with `geom_boxplot`.
IKEA Furniture	7:00	Using `geom_jitter` with `geom_boxplot` to show how many items are within each `category`.
IKEA Furniture	8:00	Use `add_count` from the `dplyr` package and `glue` from the `glue` package to concatenate the `category` name with `category_total` on the `geom_boxplot` y-axis.
IKEA Furniture	9:00	Convert from `Saudi Riyals` to `United States Dollars`.
IKEA Furniture	11:05	Create a `ridgeplot` - AKA `joyplot` - using `ggridges` package showing the distribution of `price` across `category`.
IKEA Furniture	12:50	Discussion on `distributions` and when to use a `log scale`.
IKEA Furniture	19:20	Use `fct_lump` from the `forcats` package to lump together all the levels in `category` except for the `n` most frequent.
IKEA Furniture	21:00	Use `scale_fill_discrete` from the `ggplot2` package with `guide = guide_legend(reverse = TRUE)` to reverse the `fill legend`.
IKEA Furniture	24:20	Use `str_trim` from the `stringr` package to remove whitespace from the `short_description` variable. David then decides to use `str_replace_all` instead with the following regular expression `"\\s+", " "` to replace all whitespace with a single space instead.
IKEA Furniture	25:30	Use `separate` from the `tidyr` package with `extra = "merge"` and `fill = "right"` to separate item description from item dimension.
IKEA Furniture	26:45	Use `extract` from the `tidyr` package with the regular expression `"[\\d\\-xX]+) cm"` to extract the numbers before `cm`.
IKEA Furniture	29:50	Use `unite` from the `tidyr` package to paste together the `category` and `main_description` columns into a new column named `category_and_description`.
IKEA Furniture	32:45	Calculate the volume given the `depth`, `height`, and `width` of each item in dataset in liters using `depth * height * width / 1000`. At 36:15, David decides to change to `cubic meters` instead using `depth * height * width / 1000000`.
IKEA Furniture	44:20	Use `str_squish` from the `stringr` package to remove whitespace from the start to the end of the `short_description` variable.
IKEA Furniture	48:00	Use `lm` from the `stats` package to create a linear model on a `log, log scale` to predict the price of an item based on volume + category. David then uses `fct_relevel` to reorder the factor levels for `category` such that `tables & desks` is first (starting point) since it's the most frequent item in the category variable and it's price distribution is in the middle.
IKEA Furniture	53:00	Use the `broom` package to turn the model output into a coefficient / TIE fighter plot.
IKEA Furniture	56:20	Use `str_remove` from the `stringr` package to remove `category` from the start of the strings on the y-axis using the regular expression `"^category"`
IKEA Furniture	57:50	Summary of screencast

Historical Phones

Back to summary

Screencast	Time	Description
Historical Phones	2:15	Use `bind_rows` from the `dplyr` package to combine the two data sets.
Historical Phones	7:30	Use `group = interaction(type, country)` within `ggplot` `aes()` to set the interaction `type` with every single `country` on one plot.
Historical Phones	9:30	Use `semi_join` from the `dplyr` package to join rows from `phones` with a match in `country_sizes`.
Historical Phones	14:00	Use `quantile` from the `stats` package within `summarize` to show the 25th, and 75th quantiles (interquartile range) on the plot.
Historical Phones	17:50	Import the `wdi` package (World Development Indicators from the World Bank) with `extra = TRUE` in order to get the `iso3c` code and `income` level for each country.
Historical Phones	19:45	Use `inner_join` from the `dplyr` package to join the `WDI` data with the `phones` data.
Historical Phones	20:35	Use `fct_relevel` from the `forcats` package to reorder `income` factor levels in ascending order.
Historical Phones	21:05	Create an `anonymous function` using `.` (dot).
Historical Phones	29:30	Use `inner_join` from the `dplyr` package to join the `mobile` data and `landline` data together with a `geom_abline` to see how different the total populations are between the two datasets.
Historical Phones	31:00	Use `geom_hline` to add a refrence line to the plot shwoing when each country crossed the 50 per 100 subscription mark.
Historical Phones	35:20	Use `summarize` from the `dplyr` package with `min(year([Mobile >= 50]))` to find the year in which each country crossed the 50 per 100 subscription mark.
Historical Phones	35:20	Use `summarize` from the `dplyr` package with `max(Mobile)` to find the peak number of mobile subscriptions per country.
Historical Phones	35:20	Use `na_if` from the `dplyr` package within `summarize` to change `Inf` to `NA`.
Historical Phones	38:20	Using the `WDIsearch` function to search the `WDI` package for proper GDP per capita indicator. Ended up using the `NY.GDP.PCAP.PP.KD` indicator.
Historical Phones	39:05	Adding the `GDP` data from the `WDI` package to the `country_incomes` table.
Historical Phones	39:52	Using the `inner_join` function from the `dplyr` package to join the `phones` table with the `country_incomes` table pulling in the `gdp_per_capita` variable.
Historical Phones	42:25	Using the `WDIsearch` function to search the `WDI` package for proper population indicator. Ended up using the `SP.POP.TOTL` indicator.
Historical Phones	50:00	Create an animated choropleth world map with `fill = subscriptions`.
Historical Phones	1:00:00	Summary of screencast

Riddler: Simulating a Circular Random Walk

Back to summary

Screencast	Time	Description
Riddler: Simulating a Circular Random Walk	1:25	Using `sample()` and `cumsum()` to simulate a random walk
Riddler: Simulating a Circular Random Walk	2:30	Using `%%` (modulo operator) to "close" the circle (set the number of people in the circle)
Riddler: Simulating a Circular Random Walk	3:40	Using `crossing` function to set up "tidy" simulation (gives you all possible combinations of values you provide it)
Riddler: Simulating a Circular Random Walk	5:10	Using `distinct` function and its `.keep_all` argument to get only the first unique set of the variables you give it
Riddler: Simulating a Circular Random Walk	8:15	Visualizing the number of steps it takes for the sauce to reach people at differents seats
Riddler: Simulating a Circular Random Walk	13:40	Visualizing the distribution of number of steps it takes to reach each seat
Riddler: Simulating a Circular Random Walk	26:30	Investigating the parabolic shape of average number of steps to reach a given seat
Riddler: Simulating a Circular Random Walk	28:40	Using `lm` and `I` functions to calculate formula of the parabola describing average number of steps
Riddler: Simulating a Circular Random Walk	30:15	Starting to vary the size of the table
Riddler: Simulating a Circular Random Walk	38:45	Summary of screencast

Ninja Warrior

Back to summary

Screencast	Time	Description
Ninja Warrior	2:35	Inspecting the dataset
Ninja Warrior	6:40	Using `geom_histogram` to look at distribution of obstacles in a stage
Ninja Warrior	9:05	Using `str_remove` function to clean stage names (remove "(Regional/City)")
Ninja Warrior	10:40	Asking, "Are there obstacles that are more common in the Finals than Qualifying rounds?"
Ninja Warrior	10:50	Using `bind_log_odds` function from `tidylo` package to calculate log-odds of obstacles within a stage type
Ninja Warrior	16:05	Using `unite` function to combine two columns
Ninja Warrior	18:20	Graphing the average position of different obstacles with many, many tweaks to make it look nice
Ninja Warrior	23:10	Creating a stacked bar plot of which obstacles appear in which order
Ninja Warrior	30:30	Turning stacked bar plot visualization into a custom function
Ninja Warrior	37:40	Asking, "Is there data on how difficult an obstacle is?"
Ninja Warrior	45:30	Visualizing which obstacles appear in different seasons with `geom_tile` and a lot of tweaking
Ninja Warrior	50:22	Reviewing the result of the previous step (obstacles in different seasons)
Ninja Warrior	59:25	Summary of screencast

FilesExpand file tree

screencast-annotations

Directory actions

More options

Directory actions

More options

Latest commit

History

screencast-annotations

Folders and files

parent directory

README.md

Screencast Summary

Individual Screencasts

College Majors and Income

Horror Movie Profits

R Downloads

US Wind Turbines

Malaria Incidence

Thanksgiving Dinner

Maryland Bridges

Medium Articles

Riddler: Monte Carlo Simulation

NYC Restaurant Inspections

Riddler: Simulating a Week of Rain

Dolphins

TidyTuesday Tweets

TV Golden Age

Space Launches

US Incarceration

US Dairy Consumption

US PhDs

French Train Delays

Women in the Workplace

Board Game Reviews

Seattle Pet Names

Seattle Bike Counts

Tennis Tournaments

Bird Collisions

Student Teacher Ratios

Nobel Prize Winners

Plastic Waste

Wine Ratings

Ramen Reviews

Media Franchise Revenue

Women's World Cup

Bob Ross Paintings

Simpsons Guest Stars

Pizza Ratings

Car Fuel Efficiency

Horror Movies

NYC Squirrel Census

CRAN Package Code

Riddler: Spelling Bee Honeycomb

The Office

COVID-19 Open Research Dataset (CORD-19)

CORD-19 Data Package

R Trick: Creating Pascal's Triangle with accumulate()

Riddler: Simulating Replacing Die Sides

Beer Production

Riddler: Simulating a Non-increasing Sequence

Tour de France

Riddler: Simulating a Branching Process

GDPR Violations

Broadway Musicals

Riddler: Simulating and Optimizing Coin Flipping

Animal Crossing

Volcano Eruptions

Beach Volleyball

Cocktails

African-American Achievements

African-American History

Caribou Locations

X-Men Comics

Coffee Ratings

Australian Animal Outcomes

Palmer Penguins

European Energy

Plants in Danger

Chopped

R Trick: Creating Pascal's Triangle with `accumulate()`