datacleaning

This repo (datacleaning) presents a project of datacleaning (a coursera class)

The purpose of the project is to demonstrate the ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis

This repo (datacleaning) contains 4 files: README.MD, CodeBook.MD, run_analysis.R and tidy_data.txt

README.md: a markdown file contains general information

CodeBook.Rmd: a markdown file describes variables used in tidy_data.txt

run_analysis.R: a R script performs the project and generates the final result (tidy_data.txt)

tidy_data.txt: a text file contains independent tidy data set with the average of each variarbles for each activity and each subject

Assumptions and Conditions

1. all Samsung data files have been downloaded and extract into local working directory before running run_analysis.R
1. Inertial Signals data in both test and train data sets are ignored
1. only 1 script (run_analysis) handles all, including data cleaning and generating a new tidy data set

About run_analysis.R (see script for code details)

The script performs the following:

1. read in several text files(training and test sets), merge them to create one dataset
1. extract only the measurements on the mean and standard deviation for each measurement
1. create a second, independent tidy data set with the average of each variables for each activity and each subject
1. write that tidy data set as a text file

Procedures:

1. set working directory
1. read in test group files: subject_test.txt, y_test.txt and x_test.txt
1. check any na there
1. combine into 1 dataset (test:2947x563)
1. read in train group files: subject_train.txt, y_train.txt and x_train.txt
1. check any na there
1. combine into 1 dataset (train:7352x563)
1. combine train and test into 1 dataset (dataset:10299x563)
1. read in features.txt, make a column name list (col_name), change column names based on features.txt of dataset (dataset)
1. read in activity_lables.txt, make activity based on activity_lables.txt
1. select variables with mean() and std() from column names of dataset (dataset)
1. create a data set (data:10299x68): subject, activity, all mean var.go first, followed by all std var.
1. create 2 column list: 1 for mean, 1 for std, both with subject & activity
1. subset data to 2 data sets: 1 for mean (data_mean:10299x35), 1 for std (data_std:10299x35)
1. calculate ave. of mean & ave. of std based on the combination of subject & activity from data_mean and data_std, generate 2 results (ave_mean:180x68,ave_std:180x68)
1. merge those 2 results: ave_mean and ave_std into 1 data set (tidy_data:180x68) with ordered by subject
1. write tidy_data to a text file: tidy_data.txt with column names but no row names
1. (optional) read tidy_data.txt file back for checking

About tidy_data.txt (see contents for details)

It is a tidy data set with the average of each variables for each activity and each subject in text format, has 68 variables and 180 observations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datacleaning

Assumptions and Conditions

About run_analysis.R (see script for code details)

About tidy_data.txt (see contents for details)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CodeBook.Rmd		CodeBook.Rmd
README.md		README.md
run_analysis.R		run_analysis.R
tidy_data.txt		tidy_data.txt

tofufans/datacleaning

Folders and files

Latest commit

History

Repository files navigation

datacleaning

Assumptions and Conditions

About run_analysis.R (see script for code details)

About tidy_data.txt (see contents for details)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages