This repo contains the scripts to prepare a tidy dataset from the data obtained at https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
It will create and save as text file a dataset of average values of means and standard deviations of measurements for each activity of each subject.
run_analysis.R- the main data processing scriptdownload_data.R- an additional script with a function to load and unpack the data if it is missing from the working directory.CodeBook.md- the description of variables of the resulting tidy dataset.README.md- explanation of data processing performed by the script.
- Clone this repo to your desktop
- Launch R and change the working directory to the folder with this repo's contents.
- If you have the
UCI HAR Datasetfolder or thegetdata_projectfiles_UCI HAR Dataset.zippackage - move or copy it to the working directory. If not - don't bother, the script will download it from the source. - Run the
run_analysis.Rscript. - The resulting dataset will be saved as
TidyDataset.txtin the working directory.
- This script requires the
plyrpackage. It will install and/or load it at the beginning. - It expects the data to be located in the
UCI HAR Datasetsubfolder in the working directory.
- The structure of folders and files in
UCI HAR Datasetis preserved. - The descriptive activity names and their corresponding numeric labels are in the
activity_labels.txtfile. They are ordered by the numeric label. - The names of measurements are in
features.txtfile. They are indexed and ordered by their indices.- The variables contianing the means of measurements have
-mean()in their names, and those containing standard deviations of measurements have-std()in their names.
- The variables contianing the means of measurements have
- The training data is in the
trainsubfolder, and the test data in thetestsubfolder.- The test and train measurements are in
X_train.txtandX_test.txtrespectively. - The subject codes for the train and test datasets are in the
subject_train.txtandsubject_test.txtfiles respectively. - The numeric activity labels for the train and test datasets are in
y_train.txtandy_test.txtrespectively.
- The test and train measurements are in
- The data in the
Initial Signalssubfolders is not required to complete this assignment, so the script does not use it.
- Store the directory and file names listed above in variables
- Check if
UCI HAR Datasetsubfolder is present. If not - load theDownloadDatafromdownload_data.Rwhich will locate and unpack or download and unpack thegetdata_projectfiles_UCI HAR Dataset.zipif it is missing. - Process activities data:
- Load the train and test numeric activitiy labels and bind them in a single dataframe
uciActivity. - Load the descriptive activity names and use them to change the numeric labels in
uciActivityto descriptive names.
- Load the train and test numeric activitiy labels and bind them in a single dataframe
- Process subjects data:
- Load the train and test subject codes and bind them into a single dataframe
uciSubj. - Process measurement names:
- Load the measurement names, extract only those containing
-std()or-mean()and store these names in the character vectornamesMeanSd. - Get row numbers of these names (which are same as their numeric indices) and store in an integer vector
indexMeanSd. - Clean up the names: process with
make.namesfunction, then remove duplicate dots and dots from the ends. Also replace "BodyBody" in some of the names with just "Body". - Process measurements:
- Load the train and test sets of measurements and bind them into a single dataframe
uciData. - Extract a subset
uciSubsetselecting only the columns with indices present inindexMeanSd. - Label the columns of the subset with measurement names from
namesMeanSd. - Create the tidy dataset:
- Bind the
uciSubj,uciActivityanduciSubsetdataframes into a singleucidataframe. - Apply the
ddplyfunction to split theucidataframe bysubjectandactivity, calculate the means of all measurements columns and save the result intouciTidydataframe. - Save
uciTidyasTidyDataset.txtin the working directory.