Building Predictive Text Models for Twitter, News, and Blogs Corpora

This is an educational Capstone Project,that is the part of Data Science Specialization provided by Johns Hopkins University on Coursera.

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. Smart keyboard can make it easier for people to type on their mobile devices. One cornerstone of a smart keyboard is predictive text models. When someone types:

"I went to the"

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models, and we will develop a presentational Web App for our model built.

The capstone consists of three deliverable components:

A predictive text model
A reproducible R markdown document describing model building process
A data product built with Shiny to demonstrate the use of the product

Data

The dataset provided by Coursera can be downloaded here:

Dataset

The files in the dataset named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. For our tasks we will use English LOCALE only. The data is from a corpus called HC Corpora. For more details see the README file. The files have been language filtered but may still contain some foreign text. Note that the raw data contain words of offensive and profane meaning.

Results

You can explore the report about project and the reproducible R markdown document describing model building and testing process here:

Report

Reproducible Rmd

You can have access to the demonstrational Web application here:

Word Prediction App It might take a few seconds for application to load.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
Report.Rmd		Report.Rmd
Report.html		Report.html
Report_pdf.pdf		Report_pdf.pdf
blog.optimization.results		blog.optimization.results
blogs.accuracy		blogs.accuracy
blogs.model		blogs.model
blogs.model.hashed		blogs.model.hashed
blogs.unigram.hashed		blogs.unigram.hashed
gram1.blogs.counts		gram1.blogs.counts
gram1.news.counts		gram1.news.counts
gram1.twitter.counts		gram1.twitter.counts
gram2.blogs.counts		gram2.blogs.counts
gram2.news.counts		gram2.news.counts
gram2.twitter.counts		gram2.twitter.counts
gram3.blogs.counts		gram3.blogs.counts
gram3.news.counts		gram3.news.counts
gram3.twitter.counts		gram3.twitter.counts
gram4.blogs.counts		gram4.blogs.counts
gram4.news.counts		gram4.news.counts
gram4.twitter.counts		gram4.twitter.counts
news.accuracy		news.accuracy
news.model		news.model
news.model.hashed		news.model.hashed
news.optimization.results		news.optimization.results
news.unigram.hashed		news.unigram.hashed
optimized.blogs.model		optimized.blogs.model
optimized.news.model		optimized.news.model
optimized.twitter.model		optimized.twitter.model
server.R		server.R
testing.blogs		testing.blogs
testing.news		testing.news
testing.twitter		testing.twitter
twit.optimization.results		twit.optimization.results
twitter.accuracy		twitter.accuracy
twitter.model		twitter.model
twitter.model.hashed		twitter.model.hashed
twitter.unigram.hashed		twitter.unigram.hashed
ui.R		ui.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building Predictive Text Models for Twitter, News, and Blogs Corpora

Introduction

Data

Results

About

Uh oh!

Releases

Packages

Languages

HukoJack/Natural_Language_Processing_Project

Folders and files

Latest commit

History

Repository files navigation

Building Predictive Text Models for Twitter, News, and Blogs Corpora

Introduction

Data

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages