Blog Analysis

Web Scraping and Text Analysis

Instructions:

Extract the zip file:
- Locate the zip file containing all the required files (e.g., program_files.zip).
- Right-click on the zip file.
- Choose "Extract All" or similar option based on your operating system.
- Select a destination folder where you want to extract the files and click "Extract".
Navigate to the extracted folder:
- Open the folder where you extracted the files. You should see all the necessary files including main.py, OutputDataStructure.xlsx, and two text files and requirements.txt.
Check Python installation:
- Ensure that Python is installed on your system. You can do this by opening a command prompt (Windows) or terminal (macOS/Linux) and typing:
```
python --version
```
- If Python is not installed, download and install it from the official Python website: Python Downloads
Install required libraries:
- Open a command prompt or terminal.
- Navigate to the directory where you extracted the files using the cd command.
- Install the required libraries using pip and the requirements.txt file. Run the following command:
```
pip install -r requirements.txt
```
Run the Python program:
- Open a command prompt (Windows) or terminal (macOS/Linux).
- Navigate to the directory where you extracted the files using the cd command.
- Run the Python program by typing:
```
python main.py
```
- Press Enter to execute the command.
- The program should start running.
Review the output:
- Once the program finishes execution, the output is updated in the OutputDataStructure.xlsx in the local directory

Improved Approach to the Solution:

Introduction:

The task of web scraping the Blackcoffer website involved getting the text data from the url using requests library. This was done by examining the html class name for the website, it was as straightforward task as it was consistent across most urls.

Text Cleaning:

Once the articles were retrieved, next step was to tidy up the text. NLTK library was used to remove all the stopwords and punctuations as mentioned in the objective.

Analysis Methods:

To facilitate analysis, several methods were implemented:

calculate_positive_score
calculate_negative_score
calculate_polarity_score
calculate_subjectivity_score
calculate_complex_words
calculate_average_sentence_length
calculate_avg_word_per_sentence
calculate_fog_index
count_syllables
calculate_average_syllable_count
count_personal_pronouns
calculate_average_word_length

These methods analyze the text and provide the necessary variables. The resulting variables are automatically stored within the workbook.

Challenges Faced:

Error Handling: Some URLs throw a 404 error. In such cases, all variables are set to 0.
Different Templates: Certain URLs utilize distinct HTML templates for articles, necessitating generalized code adaptable to all templates.
Workbook Automation: Streamlining the process of updating variables within the workbook.
Modular Approach: Adhering to a modular approach to prevent code clutter and enhance readability.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
Output Data.xlsx		Output Data.xlsx
README.md		README.md
main.py		main.py
negative-words.txt		negative-words.txt
positive-words.txt		positive-words.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Blog Analysis

Web Scraping and Text Analysis

Instructions:

Improved Approach to the Solution:

Introduction:

Text Cleaning:

Analysis Methods:

Challenges Faced:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sreejith2612/Blog-Analysis

Folders and files

Latest commit

History

Repository files navigation

Blog Analysis

Web Scraping and Text Analysis

Instructions:

Improved Approach to the Solution:

Introduction:

Text Cleaning:

Analysis Methods:

Challenges Faced:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages