_____ .__ .__ __ __
_/ ____\_ __| | | |_/ |_ ____ ___ ____/ |_
\ __\ | \ | | |\ __\/ __ \\ \/ /\ __\
| | | | / |_| |_| | \ ___/ > < | |
|__| |____/|____/____/__| \___ >__/\_ \ |__|
\/ \/
Get full text across all da (open access) journals
rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.
fulltext attempts to make it easy to do text-mining by supporting the following steps:
- Search for articles
- Fetch articles
- Get links for full text articles (xml, pdf)
- Extract text from articles / convert formats
- Collect bits of articles that you actually need
Additional steps we hope to include in future versions:
You can also download supplementary materials from papers.
Data sources in fulltext include:
- Crossref - via the
rcrossrefpackage - Public Library of Science (PLOS) - via the
rplospackage - Biomed Central
- arXiv - via the
aRxivpackage - bioRxiv - via the
biorxivrpackage - PMC/Pubmed via Entrez - via the
rentrezpackage - We will add more, as publishers open up, and as we have time...See the master list here
We'd love your feedback. Let us know what you think in the issue tracker.
Article full text formats by publisher:
Stable version from CRAN
install.packages("fulltext")Development version from GitHub
devtools::install_github("ropensci/fulltext")Load library
library('fulltext')If you want to use ft_extract() function, it currently has two options for how to extract text from PDFs: xpdf and ghostscript.
xpdfinstallation: See http://www.foolabs.com/xpdf/download.html for instructions on how to download and installxpdf. For OSX, you an also getxpdfvia Homebrew withbrew install xpdf. Apparently, you can optionally install Poppler, which is built onxpdf. Get it at http://poppler.freedesktop.org/ghostscriptinstallation: See http://www.ghostscript.com/doc/9.16/Install.htm
for instructions on how to download and installghostscript. For OSX, you an also getghostscriptvia Homebrew withbrew install gs.
ft_search() - get metadata on a search query.
ft_search(query = 'ecology', from = 'plos')
#> Query:
#> [ecology]
#> Found:
#> [PLoS: 29496; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0]
#> Returned:
#> [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0]ft_links() - get links for articles (xml and pdf).
res1 <- ft_search(query = 'ecology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 4
#> [IDs] ID_26420471 ID_26419522 ID_26419355 ID_26419232 ...Or pass in DOIs directly
ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 4
#> [IDs] ID_26420471 ID_26419522 ID_26419355 ID_26419232 ...ft_get() - get full or partial text of articles.
ft_get('10.1371/journal.pone.0086169', from = 'plos')
#> <fulltext text>
#> [Docs] 1
#> [Source] R session
#> [IDs] 10.1371/journal.pone.0086169 ...library("rplos")
(dois <- searchplos(q = "*:*", fl = 'id',
fq = list('doc_type:full',"article_type:\"research article\""), limit = 5)$data$id)
#> [1] "10.1371/journal.pone.0082888" "10.1371/journal.pone.0133894"
#> [3] "10.1371/journal.pone.0082883" "10.1371/journal.pone.0050020"
#> [5] "10.1371/journal.pone.0066417"
x <- ft_get(dois, from = "plos")
x %>% chunks("publisher") %>% tabularize()
#> $plos
#> publisher
#> 1 Public Library of Science\n San Francisco, USA
#> 2 Public Library of Science\nSan Francisco, CA USA
#> 3 Public Library of Science\n San Francisco, USA
#> 4 Public Library of Science\n San Francisco, USA
#> 5 Public Library of Science\nSan Francisco, USAx %>% chunks(c("doi","publisher")) %>% tabularize()
#> $plos
#> doi
#> 1 10.1371/journal.pone.0082888
#> 2 10.1371/journal.pone.0133894
#> 3 10.1371/journal.pone.0082883
#> 4 10.1371/journal.pone.0050020
#> 5 10.1371/journal.pone.0066417
#> publisher
#> 1 Public Library of Science\n San Francisco, USA
#> 2 Public Library of Science\nSan Francisco, CA USA
#> 3 Public Library of Science\n San Francisco, USA
#> 4 Public Library of Science\n San Francisco, USA
#> 5 Public Library of Science\nSan Francisco, USAUse dplyr to data munge
library("dplyr")
x %>%
chunks(c("doi", "publisher", "permissions")) %>%
tabularize() %>%
.$plos %>%
select(-permissions.license)
#> doi
#> 1 10.1371/journal.pone.0082888
#> 2 10.1371/journal.pone.0133894
#> 3 10.1371/journal.pone.0082883
#> 4 10.1371/journal.pone.0050020
#> 5 10.1371/journal.pone.0066417
#> publisher
#> 1 Public Library of Science\n San Francisco, USA
#> 2 Public Library of Science\nSan Francisco, CA USA
#> 3 Public Library of Science\n San Francisco, USA
#> 4 Public Library of Science\n San Francisco, USA
#> 5 Public Library of Science\nSan Francisco, USA
#> permissions.copyright.year permissions.copyright.holder
#> 1 2013 Jing Wang
#> 2 2015 Voorwald et al
#> 3 2013 Nejati Javaremi et al
#> 4 2012 Pi et al
#> 5 2013 Wang et al
#> permissions.license_url
#> 1 http://creativecommons.org/licenses/by/4.0/
#> 2 http://creativecommons.org/licenses/by/4.0/
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>Grab supplementary materials for (re-)analysis of data
catching.crabs <- read.csv(ft_get_si("10.6084/m9.figshare.979288", 2))
head(catching.crabs)
#> trap.no. length.deployed no..crabs
#> 1 1 10 sec 0
#> 2 2 10 sec 0
#> 3 3 10 sec 0
#> 4 4 10 sec 0
#> 5 5 10 sec 0
#> 6 1 1 min 0When dealing with full text data, you can get a lot quickly, and it can take a long time to get. That's where caching comes in. And after you pull down a bunch of data, if you do so within the R session, you don't want to lose that data if the session crashes, etc. When you search you will be able to (i.e., not ready yet) optionally cache the raw JSON/XML/etc. of each request locally - when you do that exact search again we'll just give you the local data - unless of course you want new data, which you can do.
ft_get('10.1371/journal.pone.0086169', from='plos', cache=TRUE)There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.
Locally, using code adapted from the package tm, and various pdf to text parsing backends
pdf <- system.file("examples", "example2.pdf", package = "fulltext")Using ghostscript
(res_gs <- ft_extract(pdf, "gs"))
#> <document>/Users/sacmac/github/ropensci/fulltext/inst/examples/example2.pdf
#> Title: pone.0107412 1..10
#> Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#> Creation date: 2014-09-18Using xpdf
(res_xpdf <- ft_extract(pdf, "xpdf"))
#> <document>/Users/sacmac/github/ropensci/fulltext/inst/examples/example2.pdf
#> Pages: 10
#> Title: pone.0107412 1..10
#> Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#> Creation date: 2014-09-18Or extract directly into a tm Corpus
paths <- sapply(paste0("example", 2:5, ".pdf"), function(x) system.file("examples", x, package = "fulltext"))
(corpus_xpdf <- ft_extract_corpus(paths, "xpdf"))
#> $meta
#> names class
#> 1 content, meta PlainTextDocument, TextDocument
#> 2 content, meta PlainTextDocument, TextDocument
#> 3 content, meta PlainTextDocument, TextDocument
#> 4 content, meta PlainTextDocument, TextDocument
#>
#> $data
#> <<VCorpus>>
#> Metadata: corpus specific: 0, document level (indexed): 0
#> Content: documents: 4
#>
#> attr(,"class")
#> [1] "xpdf"Extract pdf remotely on the web, using a service called PDFX
pdf5 <- system.file("examples", "example5.pdf", package = "fulltext")
pdfx(file = pdf5)#> $meta
#> $meta$job
#> [1] "34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044"
#>
#> $meta$base_name
#> [1] "5kpp"
#>
#> $meta$doi
#> [1] "10.7554/eLife.03640"
#>
#>
#> $data
#> <?xml version="1.0" encoding="UTF-8"?>
#> <pdfx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://pdfx.cs.man.ac.uk/static/article-schema.xsd">
#> <meta>
#> <job>34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044</job>
#> <base_name>5kpp</base_name>
#> <doi>10.7554/eLife.03640</doi>
#> </meta>
#> <article>
#> .....ft_plot()- vizualize metadata or full text data
- Please report any issues or bugs.
- License: MIT
- Get citation information for
fulltext:citation(package = 'fulltext') - Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

