###Overview This document describes the data and the procedure used for this project. More detailed information can be found in the comments of the code.
###Input Data: The input data consist of 561 measurements for 30 subjects each doing 6 different activities. The data are broken into two sets originally intended for machine learning; a training set and a test set, each with x inputs and y outputs.
The data are summarized as follows:
- features = 561 obs, 2 var - Contains a feature ID and feature description for 561 measurements
- X_train = 7352 obs, 561 vars - Contains the measurements for each feature
- X_test = 2947 obs, 561 vars - Contains the measurements for each feature
- y_train = 7352 obs, 1 var - Contains the activity number for each observation in X_train
- y_test = 2947 obs, 1 var - Contains the activity number for each observation in X_test
- subject_train = 7352 obs, 1 var - Contains the subject number for each observation in X_train
- subject_test = 2947 obs, 1 var - Contains the subject number for each observation in X_test
- activities = 6 obs, 2 var - Contains names of the activities for each activity number
Detailed information on how the measurements were made and what the measurements represent can be found in features_info.txt in the input dataset folder.
###Goal The goal of the project is to find all variables that represent mean or standard deviation values and produce a dataset that averages these values by subject and activity.
There are 79 variables that have mean, std or meanFreq in the name. These were selected for analysis.
- train data = 7352 obs, 79 vars
- test data = 2947 obs, 79 vars
The final data is a table of the mean of each variable by subject and activity. This results in a table of 180 rows (30 subjects * 6 activities) and 81 columns (subject, activity and 79 measurements).
The final data are written to tidy.csv.
###Procedure: ####Setup For the script to run, the input data should be extracted to the same directory as run_analysis.R. In other words, you should have the folder "UCI HAR Dataset" and run_analysis.R in the same working directory.
The input data can be found here: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
####Analysis The following describes the procedure used in run_analysis.R to manipulate the data and produce a tidy output dataset:
- Get feature names by reading features.txt and subsetting just the names column
- Get feature subset for mean and std dev using grep and sorting an array of column indices
- Read X_train data from X_train.txt and provide column names from features data
- Subset X_train for std and mean only
- Read subject_train data from subject_train.txt and provide column name
- Add subject column to X_train using cbind
- Read X_test data from X_test.txt and provide column names from features data
- Subset X_test for std and mean only
- Read subject_test data from subject_test.txt and provide column name
- Add subject column to X_test using cbind
- Add X_test rows to X_train using rbind
- Read y_train and y_test files and provide column names
- Add y_test rows to y_train using rbind
- Add factors for y_train activity names using cut
- Merge X columns with y column using cbind
- Create tidy dataset with the mean of each measurement for each subject and activity using melt and dcast
- Write tidy dataset to file using write.table
####Notes:
- Since the datasets are large, memory was released when it was no longer needed. This would allow the script to scale better if even larger datasets were used.
- A variable, maxrows, was used to reduce the amount of data read in for testing. This sped up the testing phase.