GitHub - buco/fulltext: An R api to search across and get full text for open access journals

  _____     .__  .__   __                   __
_/ ____\_ __|  | |  |_/  |_  ____ ___  ____/  |_
\   __\  |  \  | |  |\   __\/ __ \\  \/  /\   __\
 |  | |  |  /  |_|  |_|  | \  ___/ >    <  |  |
 |__| |____/|____/____/__|  \___  >__/\_ \ |__|
                                \/      \/

Get full text across all da (open access) journals

rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.

fulltext attempts to make it easy to do text-mining by supporting the following steps:

Search for articles
Fetch articles
Get links for full text articles (xml, pdf)
Extract text from articles / convert formats
Collect bits of articles that you actually need

Additional steps we hope to include in future versions:

Analysis enable via the tm package and friends, or via Spark-R
Visualization

You can also download supplementary materials from papers.

Data sources in fulltext include:

Crossref - via the rcrossref package
Public Library of Science (PLOS) - via the rplos package
Biomed Central
arXiv - via the aRxiv package
bioRxiv - via the biorxivr package
PMC/Pubmed via Entrez - via the rentrez package
We will add more, as publishers open up, and as we have time...See the master list here

We'd love your feedback. Let us know what you think in the issue tracker.

Article full text formats by publisher:

https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd

Installation

Stable version from CRAN

install.packages("fulltext")

Development version from GitHub

devtools::install_github("ropensci/fulltext")

Load library

library('fulltext')

Extraction tools

If you want to use ft_extract() function, it currently has two options for how to extract text from PDFs: xpdf and ghostscript.

xpdf installation: See http://www.foolabs.com/xpdf/download.html for instructions on how to download and install xpdf. For OSX, you an also get xpdf via Homebrew with brew install xpdf. Apparently, you can optionally install Poppler, which is built on xpdf. Get it at http://poppler.freedesktop.org/
ghostscript installation: See http://www.ghostscript.com/doc/9.16/Install.htm
for instructions on how to download and install ghostscript. For OSX, you an also get ghostscript via Homebrew with brew install gs.

Search

ft_search() - get metadata on a search query.

ft_search(query = 'ecology', from = 'plos')
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 29496; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0]

Get full text links

ft_links() - get links for articles (xml and pdf).

res1 <- ft_search(query = 'ecology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 4 
#> [IDs] ID_26420471 ID_26419522 ID_26419355 ID_26419232 ...

Or pass in DOIs directly

ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 4 
#> [IDs] ID_26420471 ID_26419522 ID_26419355 ID_26419232 ...

Get full text

ft_get() - get full or partial text of articles.

ft_get('10.1371/journal.pone.0086169', from = 'plos')
#> <fulltext text>
#> [Docs] 1 
#> [Source] R session  
#> [IDs] 10.1371/journal.pone.0086169 ...

Extract chunks

library("rplos")
(dois <- searchplos(q = "*:*", fl = 'id',
   fq = list('doc_type:full',"article_type:\"research article\""), limit = 5)$data$id)
#> [1] "10.1371/journal.pone.0082888" "10.1371/journal.pone.0133894"
#> [3] "10.1371/journal.pone.0082883" "10.1371/journal.pone.0050020"
#> [5] "10.1371/journal.pone.0066417"
x <- ft_get(dois, from = "plos")
x %>% chunks("publisher") %>% tabularize()
#> $plos
#>                                               publisher
#> 1     Public Library of Science\n    San Francisco, USA
#> 2      Public Library of Science\nSan Francisco, CA USA
#> 3     Public Library of Science\n    San Francisco, USA
#> 4 Public Library of Science\n        San Francisco, USA
#> 5         Public Library of Science\nSan Francisco, USA

x %>% chunks(c("doi","publisher")) %>% tabularize()
#> $plos
#>                            doi
#> 1 10.1371/journal.pone.0082888
#> 2 10.1371/journal.pone.0133894
#> 3 10.1371/journal.pone.0082883
#> 4 10.1371/journal.pone.0050020
#> 5 10.1371/journal.pone.0066417
#>                                               publisher
#> 1     Public Library of Science\n    San Francisco, USA
#> 2      Public Library of Science\nSan Francisco, CA USA
#> 3     Public Library of Science\n    San Francisco, USA
#> 4 Public Library of Science\n        San Francisco, USA
#> 5         Public Library of Science\nSan Francisco, USA

Use dplyr to data munge

library("dplyr")
x %>%
 chunks(c("doi", "publisher", "permissions")) %>%
 tabularize() %>%
 .$plos %>%
 select(-permissions.license)
#>                            doi
#> 1 10.1371/journal.pone.0082888
#> 2 10.1371/journal.pone.0133894
#> 3 10.1371/journal.pone.0082883
#> 4 10.1371/journal.pone.0050020
#> 5 10.1371/journal.pone.0066417
#>                                               publisher
#> 1     Public Library of Science\n    San Francisco, USA
#> 2      Public Library of Science\nSan Francisco, CA USA
#> 3     Public Library of Science\n    San Francisco, USA
#> 4 Public Library of Science\n        San Francisco, USA
#> 5         Public Library of Science\nSan Francisco, USA
#>   permissions.copyright.year permissions.copyright.holder
#> 1                       2013                    Jing Wang
#> 2                       2015               Voorwald et al
#> 3                       2013        Nejati Javaremi et al
#> 4                       2012                     Pi et al
#> 5                       2013                   Wang et al
#>                       permissions.license_url
#> 1 http://creativecommons.org/licenses/by/4.0/
#> 2 http://creativecommons.org/licenses/by/4.0/
#> 3                                        <NA>
#> 4                                        <NA>
#> 5                                        <NA>

Supplementary materials

Grab supplementary materials for (re-)analysis of data

catching.crabs <- read.csv(ft_get_si("10.6084/m9.figshare.979288", 2))
head(catching.crabs)
#>   trap.no. length.deployed no..crabs
#> 1        1          10 sec         0
#> 2        2          10 sec         0
#> 3        3          10 sec         0
#> 4        4          10 sec         0
#> 5        5          10 sec         0
#> 6        1           1 min         0

Cache

When dealing with full text data, you can get a lot quickly, and it can take a long time to get. That's where caching comes in. And after you pull down a bunch of data, if you do so within the R session, you don't want to lose that data if the session crashes, etc. When you search you will be able to (i.e., not ready yet) optionally cache the raw JSON/XML/etc. of each request locally - when you do that exact search again we'll just give you the local data - unless of course you want new data, which you can do.

ft_get('10.1371/journal.pone.0086169', from='plos', cache=TRUE)

Extract text from PDFs

There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.

Locally, using code adapted from the package tm, and various pdf to text parsing backends

pdf <- system.file("examples", "example2.pdf", package = "fulltext")

Using ghostscript

(res_gs <- ft_extract(pdf, "gs"))
#> <document>/Users/sacmac/github/ropensci/fulltext/inst/examples/example2.pdf
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Using xpdf

(res_xpdf <- ft_extract(pdf, "xpdf"))
#> <document>/Users/sacmac/github/ropensci/fulltext/inst/examples/example2.pdf
#>   Pages: 10
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Or extract directly into a tm Corpus

paths <- sapply(paste0("example", 2:5, ".pdf"), function(x) system.file("examples", x, package = "fulltext"))
(corpus_xpdf <- ft_extract_corpus(paths, "xpdf"))
#> $meta
#>           names                           class
#> 1 content, meta PlainTextDocument, TextDocument
#> 2 content, meta PlainTextDocument, TextDocument
#> 3 content, meta PlainTextDocument, TextDocument
#> 4 content, meta PlainTextDocument, TextDocument
#> 
#> $data
#> <<VCorpus>>
#> Metadata:  corpus specific: 0, document level (indexed): 0
#> Content:  documents: 4
#> 
#> attr(,"class")
#> [1] "xpdf"

Extract pdf remotely on the web, using a service called PDFX

pdf5 <- system.file("examples", "example5.pdf", package = "fulltext")
pdfx(file = pdf5)

#> $meta
#> $meta$job
#> [1] "34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044"
#>
#> $meta$base_name
#> [1] "5kpp"
#>
#> $meta$doi
#> [1] "10.7554/eLife.03640"
#>
#>
#> $data
#> <?xml version="1.0" encoding="UTF-8"?>
#> <pdfx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://pdfx.cs.man.ac.uk/static/article-schema.xsd">
#>   <meta>
#>     <job>34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044</job>
#>     <base_name>5kpp</base_name>
#>     <doi>10.7554/eLife.03640</doi>
#>   </meta>
#>    <article>
#>  .....

TODO

ft_plot() - vizualize metadata or full text data

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
R		R
inst		inst
man-roxygen		man-roxygen
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
Makefile		Makefile
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
appveyor.yml		appveyor.yml
cran-comments.md		cran-comments.md
fulltext.Rproj		fulltext.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Extraction tools

Search

Get full text links

Get full text

Extract chunks

Supplementary materials

Cache

Extract text from PDFs

TODO

Meta

About

Uh oh!

Releases

Packages

Languages

License

buco/fulltext

Folders and files

Latest commit

History

Repository files navigation

Installation

Extraction tools

Search

Get full text links

Get full text

Extract chunks

Supplementary materials

Cache

Extract text from PDFs

TODO

Meta

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages