GitHub - hardija/cleaningdata: This repo contains files for the Coursera course Getting and Cleaning Data.

Getting and Cleaning Data Project

###Overview This document describes the data and the procedure used for this project. More detailed information can be found in the comments of the code.

###Input Data: The input data consist of 561 measurements for 30 subjects each doing 6 different activities. The data are broken into two sets originally intended for machine learning; a training set and a test set, each with x inputs and y outputs.

The data are summarized as follows:

features = 561 obs, 2 var - Contains a feature ID and feature description for 561 measurements
X_train = 7352 obs, 561 vars - Contains the measurements for each feature
X_test = 2947 obs, 561 vars - Contains the measurements for each feature
y_train = 7352 obs, 1 var - Contains the activity number for each observation in X_train
y_test = 2947 obs, 1 var - Contains the activity number for each observation in X_test
subject_train = 7352 obs, 1 var - Contains the subject number for each observation in X_train
subject_test = 2947 obs, 1 var - Contains the subject number for each observation in X_test
activities = 6 obs, 2 var - Contains names of the activities for each activity number

Detailed information on how the measurements were made and what the measurements represent can be found in features_info.txt in the input dataset folder.

###Goal The goal of the project is to find all variables that represent mean or standard deviation values and produce a dataset that averages these values by subject and activity.

There are 79 variables that have mean, std or meanFreq in the name. These were selected for analysis.

train data = 7352 obs, 79 vars
test data = 2947 obs, 79 vars

The final data is a table of the mean of each variable by subject and activity. This results in a table of 180 rows (30 subjects * 6 activities) and 81 columns (subject, activity and 79 measurements).

The final data are written to tidy.csv.

###Procedure: ####Setup For the script to run, the input data should be extracted to the same directory as run_analysis.R. In other words, you should have the folder "UCI HAR Dataset" and run_analysis.R in the same working directory.

The input data can be found here: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

####Analysis The following describes the procedure used in run_analysis.R to manipulate the data and produce a tidy output dataset:

Get feature names by reading features.txt and subsetting just the names column
Get feature subset for mean and std dev using grep and sorting an array of column indices
Read X_train data from X_train.txt and provide column names from features data
Subset X_train for std and mean only
Read subject_train data from subject_train.txt and provide column name
Add subject column to X_train using cbind
Read X_test data from X_test.txt and provide column names from features data
Subset X_test for std and mean only
Read subject_test data from subject_test.txt and provide column name
Add subject column to X_test using cbind
Add X_test rows to X_train using rbind
Read y_train and y_test files and provide column names
Add y_test rows to y_train using rbind
Add factors for y_train activity names using cut
Merge X columns with y column using cbind
Create tidy dataset with the mean of each measurement for each subject and activity using melt and dcast
Write tidy dataset to file using write.table

####Notes:

Since the datasets are large, memory was released when it was no longer needed. This would allow the script to scale better if even larger datasets were used.
A variable, maxrows, was used to reduce the amount of data read in for testing. This sped up the testing phase.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
run_analysis.R		run_analysis.R
tidy.csv		tidy.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Getting and Cleaning Data Project

About

Uh oh!

Releases

Packages

Languages

hardija/cleaningdata

Folders and files

Latest commit

History

Repository files navigation

Getting and Cleaning Data Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages