Paperity: a multidisciplinary Open Access content aggregator

Although the main target of Open Access is … well, just that, content being freely and openly accessible for anyone in perpetuity, it does have the additional advantage that content may be re-used. Of course that use has to fit with the license given, but with most OA having a CC-BY license, there are a lot of opportunities for aggregation, text mining and the like. It is interesting that up to now only a few aggregation initiatives have sprung up, most notably PubMed Central (3.2M full text OA papers) and Europe PubMed Central  (570K full text OA papers), that aggregate OA content in biomedical and life science. In PMC and PMC Europe most content is deposited by publishers and authors. Apart from these subject specific initiatives there aren’t many full text OA aggregators. Other sites either are not limited to OA and do not aggregate the papers in one place (e.g. Google Scholar) or  do no full text indexing and also no aggregation (e.g. BASE, Oaister, DOAJ).

Enter Paperity, that was launched last week (so in October 2014): a multidisciplinary aggregator of Open Access scholarly content. It holds over 160K articles from over 2,100 journals. It is an initiative from Poland (well at least the founder is from Poland, although the website is registered in France) and led by Marcin Wojnarski. Paperity has a slick and friendly website that offers access to aggregated OA content, with full text search and a built-in PDF-reader. It promises more functionality to communicate around papers and use Web 2.0 options. It has a list of journals covered, and links to versions of the same paper on publisher websites. Publishers/editors of OA journals can request for their journals to be included.

Almost immediately, on Twitter and  in a thread over at the GOAL Open Access discussion list, questions were raised, and answers given, that I will summarize here.

1) What are the inclusion criteria used by Paperity? Paperity aims at 100% of Open Access peer reviewed papers. Currently it is at 160K papers, which is somewhere around 10% of Gold OA papers but below 2 percent if you include of Green OA content. It is not stated explicitly but it seems logical y that Paperity only aggregates stuff that it is allowed to aggregate (so not if ‘no derivatives’ is in the CC license).
2) What is the business model of Paperity? Paperity seems to have started as a ‘non-profit academic project’, but it will have to look for more structural funding, which might include adds or charging journals.
3) Will Paperity allow text mining through a API or otherwise? According to Wojnarski that is not possible currently but Paperity is certainly sympathetic to the idea.
4) Why does Paperity focus on Gold OA journals? Paperity regards this content as the most reliable in terms of bibliographic data. Although repositories are easy to harvest, Paperity says that determining the version and status of texts is more difficult than with publisher provided full text journals. This initial focus on Gold OA also makes it easier to strictly have only peer reviewed content, according to Paperity.

If Paperity develops further I would like to see them start aggregating Green OA soon and also add more functionality in the built in PDF reader (e.g. annotations), text mining options, more advanced search and browsing and faceted search results.

Jeroen Bosman, @jeroenbosman

  1. tepronk

    That is a really good initiative. With regard to text mining: Will Paperity also negotiate the licence terms to enable text mining in a practical sense? CC-BY is a pretty lenient licence, only attribution is required. Nevertheless if you mine all content, how do you prevent your attribution list holds more than -xxxxx- references? And if you merge two text mining papers, will you have to attribute everything in both papers, etc? I don’t know this subject in detail, but I know the library in Leiden has a special focus point in negotiating text mining licences with publishers.

  2. Marcin Wojnarski

    Great post, thanks Jeroen! We’ll do our best to fulfill the “wish list”, and advanced search is for sure top priority.

    As to text mining, there is a bit of misconception going around: many think that TM is forbidden unless explicitly allowed. Not true. If you only can lawfully read a paper, no matter by what means and what license, there’s no law to prevent you from analyzing – in any way you wish – the information conveyed by this paper. The issue between libraries and publishers is of a different sort: that publishers disallow *massive download* of *toll-access* articles. That’s why text mining becomes impossible or difficult – but only TM on toll-access papers, and not because TM itself is forbidden, but because subscription access is restricted by publishers to a limited bandwidth (max no. of downloads). However, if you stay with OA literature alone, there are no ways to prevent you from text-mining it. “The right to read is the right to mine”, as Peter Murray-Rust likes to say 🙂

  3. tepronk

    That clears up a lot of things. I didn’t know this. That makes tm possibilities far more practical than I thought. Thanks Marcin.


