Bayesian learning for classifying netnews text articles:

Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents. We use a dataset containing 20,000 newsgroup messages drawn from the 20 newsgroups. The dataset contains 1000 documents from each of the 20 newsgroups.

Download the data from http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html (Newsgroup Data)

How to run the code:

The only variable that you need to give a value before running the code is ‘path’. You should give the whole path to the directory where the data is located.

Algorithm:

The project has three phases: Preprocessing, training, and testing. In the preprocessing phase, we clean the data. There are some characters that they need to either be removed or be replaced with space character.

We consider two list remove_list and replace_list, and remove all the characters inside the remove_list from the data and replaced all character inside the replaced_list with space.

Replace_list = ["'",'!','/','\','=',',',':', ‘\n’]
Remove_list = ['<','>','?','.','"',')','(','|','-','#','*','+']

We convert all the uppercase letters to lowercase ones. clean_text function does this part and cleans the data. In the training phase, We select first 500 files of each class, and split the file with character space. We count the number of occurrences of each word in class’s files and put them in a dictionary. After the training phase, we get totally around 170,000 different words in all classes (for 10,000 files).

In the testing phase: we pick files randomly from the remaining files from the training phase, function get_file gives us the files. The same as before, we clean the file with clean_text function, after that we comput the probability of each word in this file for all classes

  For each class c:
        Pc = 0
        For each word in cleaned_file:
                Pc = Pc + log(P(word| c))

Function get_probability gives us the probability for a given class. The predicated class would be the one has the highest probability Pc .

The accuracy rate for this code is 87%.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Naive_Bayes.py		Naive_Bayes.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bayesian learning for classifying netnews text articles:

How to run the code:

Algorithm:

About

Uh oh!

Releases

Packages

Languages

mohsen-imani/Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Bayesian learning for classifying netnews text articles:

How to run the code:

Algorithm:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages