-
Extract the zip file:
- Locate the zip file containing all the required files (e.g.,
program_files.zip). - Right-click on the zip file.
- Choose "Extract All" or similar option based on your operating system.
- Select a destination folder where you want to extract the files and click "Extract".
- Locate the zip file containing all the required files (e.g.,
-
Navigate to the extracted folder:
- Open the folder where you extracted the files. You should see all the necessary files including
main.py,OutputDataStructure.xlsx, and two text files and requirements.txt.
- Open the folder where you extracted the files. You should see all the necessary files including
-
Check Python installation:
- Ensure that Python is installed on your system. You can do this by opening a command prompt (Windows) or terminal (macOS/Linux) and typing:
python --version - If Python is not installed, download and install it from the official Python website: Python Downloads
- Ensure that Python is installed on your system. You can do this by opening a command prompt (Windows) or terminal (macOS/Linux) and typing:
-
Install required libraries:
- Open a command prompt or terminal.
- Navigate to the directory where you extracted the files using the
cdcommand. - Install the required libraries using pip and the
requirements.txtfile. Run the following command:pip install -r requirements.txt
-
Run the Python program:
- Open a command prompt (Windows) or terminal (macOS/Linux).
- Navigate to the directory where you extracted the files using the
cdcommand. - Run the Python program by typing:
python main.py - Press Enter to execute the command.
- The program should start running.
-
Review the output:
- Once the program finishes execution, the output is updated in the OutputDataStructure.xlsx in the local directory
The task of web scraping the Blackcoffer website involved getting the text data from the url using requests library. This was done by examining the html class name for the website, it was as straightforward task as it was consistent across most urls.
Once the articles were retrieved, next step was to tidy up the text. NLTK library was used to remove all the stopwords and punctuations as mentioned in the objective.
To facilitate analysis, several methods were implemented:
calculate_positive_scorecalculate_negative_scorecalculate_polarity_scorecalculate_subjectivity_scorecalculate_complex_wordscalculate_average_sentence_lengthcalculate_avg_word_per_sentencecalculate_fog_indexcount_syllablescalculate_average_syllable_countcount_personal_pronounscalculate_average_word_length
These methods analyze the text and provide the necessary variables. The resulting variables are automatically stored within the workbook.
- Error Handling: Some URLs throw a 404 error. In such cases, all variables are set to 0.
- Different Templates: Certain URLs utilize distinct HTML templates for articles, necessitating generalized code adaptable to all templates.
- Workbook Automation: Streamlining the process of updating variables within the workbook.
- Modular Approach: Adhering to a modular approach to prevent code clutter and enhance readability.