Data Science Capstone Project

Repository for the Coursera Data Science Capstone Project

This repository contains the code for an application that predicts the next word given an input phrase. It is the capstone project deliverable.

The application is an R shiny app, that can run locally or on shinyapps.io.

The actual implementation uses a 5-gram model together with the "Stupid Backoff" algorithm from here.

Outline

These are the main artifacts in the repository:

An intermediate report that simply analyses the source data.
A final report slideset that summarizes the work.
A set of pre-processing steps that transform the source data into look-up tables for prediction.
A shiny application that predicts words based on the stupid backoff algorithm and the LUTs.
A common tokenization module.

Intermediate report

The report is contained in the milestoneReport.Rmd R markdown file.

Final slideset

The final slideset is contained in the finalPresentation.Rpres file; an accompaning .css file is also present that tweaks the code display slightly.

Pre-processing

For efficiency, the pre-processing had to be divided into a chain of R, Unix and Java tools; these tools are run in sequence to produce the final prediction tables. All of these tools produce their outputs under sub-directories of the work directory.

The downloadData.R script pulls the original data and decompresses it into original.
The processOriginal.sh script uses the Unix split command to divide the data into train, test and validation sets, each set into its own folder.
The buildNgrams.R script takes the data in train and constructs the n-gram tables, placing them in the tables directory, as ngram[1-5].txt files.
The wordParser.sh invokes the wordParser.jar Java application to build the final LUTs from the n-gram tables, these are output as p_num_ngram[2-5].csv, as well as the p_num_ngram_words.csv that contains the word dictionary.

Shiny application

The shiny application is contained on the server.R and ui.R files. It uses the functions in lut_predict.R to do the actual prediction.

Common modules

The following modules are used:

tokenize.R is used by the pre-processing and prediction applications. It contains the tokenization functions that are applied to both the train data and the inputs provided for prediction.
ngrams.R contains the functions to load data and build n-gram tables. It is used both for pre-processing and in the milestone report.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
benchdata		benchdata
src/main/java/org/jmartins		src/main/java/org/jmartins
.gitignore		.gitignore
Capstone Project.Rproj		Capstone Project.Rproj
README.md		README.md
bench_output.txt		bench_output.txt
benchmark.R		benchmark.R
buildNgrams.R		buildNgrams.R
data.zip		data.zip
downloadData.R		downloadData.R
final_presentation.Rpres		final_presentation.Rpres
final_presentation.css		final_presentation.css
lut_predict.R		lut_predict.R
milestoneReport.Rmd		milestoneReport.Rmd
ngrams.R		ngrams.R
pom.xml		pom.xml
pre-proc.sh		pre-proc.sh
processOriginal.sh		processOriginal.sh
quanteda.R		quanteda.R
server.R		server.R
tokenize.R		tokenize.R
u_bench.txt		u_bench.txt
ui.R		ui.R
wordParser.sh		wordParser.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Science Capstone Project

Outline

Intermediate report

Final slideset

Pre-processing

Shiny application

Common modules

About

Uh oh!

Releases

Packages

Languages

joaotmartins/datascience

Folders and files

Latest commit

History

Repository files navigation

Data Science Capstone Project

Outline

Intermediate report

Final slideset

Pre-processing

Shiny application

Common modules

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages