Journalism Scrapers

A collection of scripts that scrape and format comments from several major news publications.

Getting Started

Dependencies

Install the following dependencies in your terminal.

Use the command

pip install

New York Times Scraper Requirements

In order to use the scraper to obtain any user comments, you must have a New York Times Developer API key.

FiveThirtyEight Scraper Requirements

In order to use the scraper to obtain any user comments, you must have Selenium installed.

Washington Post Scraper Requirements

In order to use the scraper to obtain any user comments, you must have Selenium installed.

Additional Requirements

In order to use the scrapers' write_to_gsheet() methods, you must have service account and OAuth2 credentials from the Google API Console.

Limitations

The New York Times scraper obtains a comment's Article URL, Parent ID, Comment ID, User Display Name, Comment Body, Upload Date, Number of Likes, Number of Replies, and Editor's Selection.

The Washington Post scraper obtains a comment's Article URL, User Display Name, Comment Body, Upload Date, and Number of Likes.

The FiveThirtyEight scraper obtains a comment's Article URL, User Display Name, Comment Body, and Upload Date.

Code Walkthrough

Begin by initializing a new instance of your desired scraper.

WaPo_Scraper = washingtonpost(my_chromedriver_path)
NYT_Scraper = nyt(my_api_key)
FiveThirtyEight_Scraper = fivethirtyeight(my_chomedriver_path)

You can retrieve a list of comments from a single article using the article URL with the get_article_comments() method.

my_article = "https://www.washingtonpost.com/politics/2021/04/13/risk-reward-calculus-johnson-johnson-vaccine-visualized/"
WaPo_Scraper.get_article_comments(my_article)

You can retrieve a list of comments from a list of articles with the get_comments_from_multiple_articles() method.

my_article_list = ["https://www.nytimes.com/2015/04/12/opinion/sunday/david-brooks-the-moral-bucket-list.html", "https://www.nytimes.com/2019/06/21/science/giant-squid-cephalopod-video.html", "https://www.nytimes.com/2021/08/01/insider/the-olympics-that-feel-like-only-competitions.html"]

NYT_Scraper.get_comments_from_multiple_articles(my_article_list)

You can retrieve a list of articles from a Google Spreadsheet with the get_articles_from_spreadsheet() method.

FiveThirtyEight_Scraper.get_articles_from_spreadsheet(spreadsheet_url, sheet_number)

You can convert a list of comments into a Pandas dataframe with the get_dataframe() method.

WaPo_Scraper.get_dataframe(comments_list)

You can write a dataframe of comments into a Google Spreadsheet with the write_to_gsheet() method.

NYT_Scraper.write_to_gsheet(dataframe, gsheet_path, gsheet_name, sheet_number)

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
LICENSE		LICENSE
README.md		README.md
fivethirtyeight.py		fivethirtyeight.py
nytimes.py		nytimes.py
washingtonpost.py		washingtonpost.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Journalism Scrapers

Getting Started

Dependencies

New York Times Scraper Requirements

FiveThirtyEight Scraper Requirements

Washington Post Scraper Requirements

Additional Requirements

Limitations

Code Walkthrough

About

Uh oh!

Releases

Packages

Languages

License

comp-journalism/comments

Folders and files

Latest commit

History

Repository files navigation

Journalism Scrapers

Getting Started

Dependencies

New York Times Scraper Requirements

FiveThirtyEight Scraper Requirements

Washington Post Scraper Requirements

Additional Requirements

Limitations

Code Walkthrough

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages