Skip to content

A collection of scripts that scrape and format comments from several major news publications.

License

Notifications You must be signed in to change notification settings

comp-journalism/comments

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Journalism Scrapers

Email LinkedIn License License

A collection of scripts that scrape and format comments from several major news publications.

Getting Started

Dependencies

Install the following dependencies in your terminal.

Use the command

pip install

New York Times Scraper Requirements

In order to use the scraper to obtain any user comments, you must have a New York Times Developer API key.

FiveThirtyEight Scraper Requirements

In order to use the scraper to obtain any user comments, you must have Selenium installed.

Washington Post Scraper Requirements

In order to use the scraper to obtain any user comments, you must have Selenium installed.

Additional Requirements

In order to use the scrapers' write_to_gsheet() methods, you must have service account and OAuth2 credentials from the Google API Console.

Limitations

The New York Times scraper obtains a comment's Article URL, Parent ID, Comment ID, User Display Name, Comment Body, Upload Date, Number of Likes, Number of Replies, and Editor's Selection.

The Washington Post scraper obtains a comment's Article URL, User Display Name, Comment Body, Upload Date, and Number of Likes.

The FiveThirtyEight scraper obtains a comment's Article URL, User Display Name, Comment Body, and Upload Date.

Code Walkthrough

Begin by initializing a new instance of your desired scraper.

WaPo_Scraper = washingtonpost(my_chromedriver_path)
NYT_Scraper = nyt(my_api_key)
FiveThirtyEight_Scraper = fivethirtyeight(my_chomedriver_path)

You can retrieve a list of comments from a single article using the article URL with the get_article_comments() method.

my_article = "https://www.washingtonpost.com/politics/2021/04/13/risk-reward-calculus-johnson-johnson-vaccine-visualized/"
WaPo_Scraper.get_article_comments(my_article)

You can retrieve a list of comments from a list of articles with the get_comments_from_multiple_articles() method.

my_article_list = ["https://www.nytimes.com/2015/04/12/opinion/sunday/david-brooks-the-moral-bucket-list.html", "https://www.nytimes.com/2019/06/21/science/giant-squid-cephalopod-video.html", "https://www.nytimes.com/2021/08/01/insider/the-olympics-that-feel-like-only-competitions.html"]

NYT_Scraper.get_comments_from_multiple_articles(my_article_list)

You can retrieve a list of articles from a Google Spreadsheet with the get_articles_from_spreadsheet() method.

FiveThirtyEight_Scraper.get_articles_from_spreadsheet(spreadsheet_url, sheet_number)

You can convert a list of comments into a Pandas dataframe with the get_dataframe() method.

WaPo_Scraper.get_dataframe(comments_list)

You can write a dataframe of comments into a Google Spreadsheet with the write_to_gsheet() method.

NYT_Scraper.write_to_gsheet(dataframe, gsheet_path, gsheet_name, sheet_number)

About

A collection of scripts that scrape and format comments from several major news publications.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%