OpenCitations

Corpus

The OpenCitations project is currently creating an open citation database, i.e. the Open Citations Corpus (OCC), an open repository of scholarly citation data made available under a Creative Commons public domain dedication (CC0), which provides accurate bibliographic references harvested from the scholarly literature that others may freely build upon, enhance and reuse for any purpose, without restriction under copyright or database law.

The OCC stores metadata relevant to these citations in RDF, encoded as JSON-LD. Around the end of August, all the data will be also available as downloadable datasets. In the meantime an exemplar dataset compliant with the OCC metadata model has been made available starting from article metadata gathered via Europe PubMed Central.

The OCC includes information about six different kinds of bibliographic entities:

The corpus URL (https://w3id.org/oc/corpus/) identifies the entire OCC, which is composed of several sub-datasets, one for each of the aforementioned bibliographic entities included in the corpus. Each of these has a URL composed by suffixing the corpus URL with the two-letter short name for the class of entity (e.g. be for a bibliographic entry) followed by an oblique slash (e.g. https://w3id.org/oc/corpus/be/). Each dataset is described appropriately by means of the Data Catalog Vocabulary and the VoID Vocabulary, and a SPARQL endpoint is made available for all the entities included in the entire OCC.

The ingestion workflow

The ingestion of citation data, briefly summarised in Figure 1, into the OCC is handled by two Python scripts called Bibliographic Entries Extractor (BEE) and the SPAR Citation Indexer (SPACIN), available in the OpenCitations's GitHub repository.

The steps involving BEE and SPACIN, and their related Python classes, in the production of the OpenCitations Corpus.

Figure 1. The steps involving BEE and SPACIN, and their related Python classes, in the production of the OpenCitations Corpus.

BEE is responsible for the creation of JSON files containing information about the articles in the OA subset of PubMed Central (retrieved by using the Europe PubMed Central API). Each of these JSON files is created by asking Europe PubMed Central about all the metadata of the articles it stores that have available the source XML file. Once identified, BEE processes all the XML sources so as to extract the complete reference list of the paper in consideration, and includes all the data in the final JSON file. An excerpt of one of those JSON files is introduced as follows:

{
  "doi": "10.1007/s11892-016-0752-4",
  "pmid": "27168063",
  "pmcid": "PMC4863913",
  "localid": "MED-27168063",
  "curator": "BEE EuropeanPubMedCentralProcessor",
  "source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML",
  "source_provider": "Europe PubMed Central",
  "references": [
    ...
    {
      "bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using a computational program and its relationship to autoreactive T cells, Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039, PMID: 19461125",
      "pmid": "19461125",
      "doi": "10.1093/intimm/dxp039",
      "pmcid": "PMC2686615",
      "process_entry": "True"
    },
    ...
  ]
}

In particular, for each article retrieved by means of the Europe PubMed Central API, BEE stores all the possible identifiers (in the example, doi, pmid, pmcid, and localid) and all the textual references, enriched by their own related identifiers if they are available. In addition, the JSON file also includes provenance information about the source, its provider and the curator (i.e. the particular BEE Python class responsible for the extraction of these metadata from the source).

Starting from the output provided by BEE, SPACIN processes each JSON file, retrieving metadata information about all the citing/cited articles described in it by querying the Crossref API and the ORCID API. These API are also used to disambiguate bibliographic resources and agents by means of the identifiers retrieved (e.g., DOI, ISSN, ISBN, ORCID, URL, and Crossref member URL). Once SPACIN has retrieved all these metadata, appropriate RDF resource are created (or reused, if they have been already added in the past) and stored in the file system in JSON-LD format and additionally within the OCC triplestore. It is worth noting that, for space and performance reasons, the triplestore includes all the data about the curated entities, but does not store their provenance data nor the descriptions of the datasets themselves, that are accessible only via HTTP.

The SPACIN workflow introduced in Figure 1 is a process that runs until no more JSON files are available from BEE. Thus, the current instance of the OCC is evolving dynamically in time, and can be easily extended beyond Europe Pubmed Central by reconfiguring it to interact with additional REST APIs from different sources, so as to gather new article metadata and their related references.

Each day the workflow adds 20,000 new citing/cited bibliographic resources and approximately 3000 new ORCID identifiers per day.