OpenCitations

Corpus

The OpenCitations Project created in 2016 and is now populating the Open Citations Corpus (OCC), an open repository of scholarly citation data made available under a Creative Commons public domain dedication (CC0), which provides accurate bibliographic references harvested from the scholarly literature that others may freely build upon, enhance and reuse for any purpose, without restriction under copyright or database law.

The OCC includes information about six different kinds of bibliographic entities:

The corpus URL (https://w3id.org/oc/corpus/) identifies the entire OCC, which is composed of several sub-datasets, one for each of the aforementioned bibliographic entities included in the corpus. Each of these has a URL composed by suffixing the corpus URL with the two-letter short name for the class of entity (e.g. be for a bibliographic entry) followed by an oblique slash (e.g. https://w3id.org/oc/corpus/be/). Individual members of each sub-dataset are identified by incrementing numbers, unique within that sub-dataset, e.g. https://w3id.org/oc/corpus/br/1 or https://w3id.org/oc/corpus/be/24.

The OCC stores data relevant to these citations in RDF, encoded as JSON-LD, and each dataset is described appropriately by means of the Data Catalog Vocabulary and the VoID Vocabulary.

All the data within the OCC are available via a SPARQL endpoint, and by downloading data dumps that are created regularly every month for the corpus as a whole and for each major datatype within it, and which are stored in Figshare. Previous versions of the OCC are archived in earlier data dumps. Data can be unpacked from the zip files of these data dumps and exported in a number of commonly used reference management formats.

In addition, the metadata for each individual bibliographic resource recorded in the OCC are available for browsing in a variety of formats (plain text, RDF/XML, Turtle and JSON-LD) by means of a simple Web interface. This shows only the data concerning that one bibliographic entity (e.g. https://w3id.org/oc/corpus/br/1) and the citations it contains.

The ingestion workflow

The ingestion of citation data into the oCC, briefly summarised in Figure 1, is handled by two Python scripts called the Bibliographic Entries Extractor (BEE) and the SPAR Citation Indexer (SPACIN), available in the OpenCitations GitHub repository.

The steps involving BEE and SPACIN, and their related Python classes, in the production of the OpenCitations Corpus.

Figure 1. The steps involving BEE and SPACIN, and their related Python classes, in the production of the OpenCitations Corpus.

BEE is responsible for the creation of JSON files containing information about the articles in the Open Access subset of PubMed Central (retrieved by using the Europe PubMed Central API). Each of these JSON files is created by asking Europe PubMed Central for all the metadata of the articles it stores, for which the source XML file is available. Once identified, BEE processes all the XML sources so as to extract the complete reference list of the paper under consideration, and includes all these data in the final JSON file. An excerpt of one of those JSON files is introduced as follows:

{
  "doi": "10.1007/s11892-016-0752-4",
  "pmid": "27168063",
  "pmcid": "PMC4863913",
  "localid": "MED-27168063",
  "curator": "BEE EuropeanPubMedCentralProcessor",
  "source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML",
  "source_provider": "Europe PubMed Central",
  "references": [
    ...
    {
      "bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using a computational program and its relationship to autoreactive T cells, Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039, PMID: 19461125",
      "pmid": "19461125",
      "doi": "10.1093/intimm/dxp039",
      "pmcid": "PMC2686615",
      "process_entry": "True"
    },
    ...
  ]
}

In particular, for each article retrieved by means of the Europe PubMed Central API, BEE stores all the possible identifiers (in the example, doi, pmid, pmcid, and localid) and all the textual references, enriched by their own related identifiers if these are available. In addition, the JSON file also includes provenance information about the source, its provider and the curator (i.e. the particular BEE Python class responsible for the extraction of these metadata from the source).

Starting from the output provided by BEE, SPACIN processes each JSON file, retrieving metadata information about all the citing/cited articles described in it by querying the Crossref API and the ORCID API. These APIs are also used to disambiguate bibliographic resources and agents by means of the identifiers retrieved (e.g., DOI, ISSN, ISBN, ORCID, URL, and Crossref Member URL). Once SPACIN has retrieved all these metadata, appropriate RDF resources are created (or reused, if they have been already added to the OCC in the past). These are stored in the file system in JSON-LD format and additionally within the OCC triplestore. It is worth noting that, for space and performance reasons, the triplestore includes all the data about the curated entities, but does not store their provenance data nor the descriptions of the datasets themselves, which are accessible only via HTTP.

The SPACIN workflow introduced in Figure 1 is a process that runs until no more JSON files are available from BEE. Thus, the current instance of the OCC is evolving dynamically in time, and can be easily extended beyond ingest from Europe PubMed Central by reconfiguring it to interact with additional REST APIs provided by different bibliographic sources, so as to gather new article metadata and their related references, thereby expanding the scope and coverage provided by the OCC.

At present, each day the workflow adds ~20,000 new citing/cited bibliographic resources and approximately 3000 new ORCID identifiers to the OCC. We plan to accelerate this rate of ingest considerably in the coming months.