This repo (datacleaning) presents a project of datacleaning (a coursera class)
The purpose of the project is to demonstrate the ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis
This repo (datacleaning) contains 4 files: README.MD, CodeBook.MD, run_analysis.R and tidy_data.txt
README.md: a markdown file contains general information
CodeBook.Rmd: a markdown file describes variables used in tidy_data.txt
run_analysis.R: a R script performs the project and generates the final result (tidy_data.txt)
tidy_data.txt: a text file contains independent tidy data set with the average of each variarbles for each activity and each subject
-
- all Samsung data files have been downloaded and extract into local working directory before running run_analysis.R
-
- Inertial Signals data in both test and train data sets are ignored
-
- only 1 script (run_analysis) handles all, including data cleaning and generating a new tidy data set
The script performs the following:
-
- read in several text files(training and test sets), merge them to create one dataset
-
- extract only the measurements on the mean and standard deviation for each measurement
-
- create a second, independent tidy data set with the average of each variables for each activity and each subject
-
- write that tidy data set as a text file
Procedures:
-
- set working directory
-
- read in test group files: subject_test.txt, y_test.txt and x_test.txt
-
- check any na there
-
- combine into 1 dataset (test:2947x563)
-
- read in train group files: subject_train.txt, y_train.txt and x_train.txt
-
- check any na there
-
- combine into 1 dataset (train:7352x563)
-
- combine train and test into 1 dataset (dataset:10299x563)
-
- read in features.txt, make a column name list (col_name), change column names based on features.txt of dataset (dataset)
-
- read in activity_lables.txt, make activity based on activity_lables.txt
-
- select variables with mean() and std() from column names of dataset (dataset)
-
- create a data set (data:10299x68): subject, activity, all mean var.go first, followed by all std var.
-
- create 2 column list: 1 for mean, 1 for std, both with subject & activity
-
- subset data to 2 data sets: 1 for mean (data_mean:10299x35), 1 for std (data_std:10299x35)
-
- calculate ave. of mean & ave. of std based on the combination of subject & activity from data_mean and data_std, generate 2 results (ave_mean:180x68,ave_std:180x68)
-
- merge those 2 results: ave_mean and ave_std into 1 data set (tidy_data:180x68) with ordered by subject
-
- write tidy_data to a text file: tidy_data.txt with column names but no row names
-
- (optional) read tidy_data.txt file back for checking
It is a tidy data set with the average of each variables for each activity and each subject in text format, has 68 variables and 180 observations