Open Biomedical Citations in Context Corpus

The Open Biomedical Citations in Context Corpus (a.k.a. CCC) is a RDF dataset, developed thanks to the financial support of the Wellcome Trust, which includes bibliographic and citation data mined from the full-text of articles collected in the Open Access Subset of PubMed Central. Articles served as JATS/XML documents and have been harvested by using the Europe PubMed Central REST API.

Scope

Similarly to the OpenCitations Corpus, CCC includes information about bibliographic entities such as bibliographic resources (br), resource embodiments (re), bibliographic references (be), citations (ci), responsible agents (ra), agents' roles (ar), and identifiers (id). In addition, CCC includes detailed information about in-text references (rp) - e.g. (Daquino et al. 2020) -, groupings of in-text references (pl), discourse elements (de) - including sentences, paragraphs, footnotes, captions, tables, sections -, and citation annotations (an). Read about the OCDM model for further details.

It's worth noting that CCC extends OpenCitations identifiers to address the following aspects:

Data access

The corpus is composed by several subsets, one for each type of entity. For instance https://w3id.org/oc/ccc/rp/ is the subset of all the in-text reference pointers. Each entity of the subset is identified by an incremental number, unique within the subset (e.g. https://w3id.org/oc/ccc/rp/1). Data are served in JSON-LD (see the context.json file) and each subset is described by means of the Data Catalog Vocabulary and the VoID Vocabulary.

CCC data are available as dumps on Figshare

Licensing and disclaimer

CCC is (proudly) released under a Creative Commons CC0 public domain waiver. For this reason, CCC does not include information that may fall under IP restrictions (e.g. full-text of articles). When searching and browsing data via CCC web search interface, text of sentences including in-text references are shown to users for a better experience. However, such data are not persistently stored in CCC and its reuse must follow publishers' rules.

An example

Here below a brief example (in .ttl syntax) of some aspects already addressed in the OpenCitations Corpus and new ones introduced in CCC.

The journal article br/0701 cites the journal article br/0702. The journal article br/0702 is referenced in the list of bibliographic references by the reference identified as be/0701.

# a simple citation
ccc:br/0701 a fabio:JournalArticle ;
  cito:cites ccc:br/0702 ;
  frbr:part ccc:be/0701 .

ccc:br/0702 a fabio:JournalArticle .

ccc:be/0701 a biro:BibliographicReference ;
  biro:references ccc:br/0702 ;
  oco:hasAnnotation ccc:an/0701 .

ccc:ci/0701-0702 a cito:Citation ;
  cito:hasCitingEntity ccc:br/0701 ;
  cito:hasCitedEntity ccc:br/0702 ;
  datacite:hasIdentifier ccc:id/1 .

ccc:id/1 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:oci ;
  literal:hasLiteralValue "oci:0701-0702".

ccc:an/0701 a oa:Annotation ;
  oa:hasBody ccc:ci/0701-0702 .

In br/0701 there are two in-text references (rp/0701 and rp/0702) denoting the bibliographic reference be/0701. Each in-text reference is identified by an InTRePID, and it is associated with an annotation that refers to a (new, specific) citation.

# first in-text reference and citation
ccc:rp/0701 a c4o:InTextReferencePointer ;
  oco:hasAnnotation ccc:an/0702 ;
  datacite:hasIdentifier ccc:id/0705 .

ccc:an/0702 a oa:Annotation ;
  oa:hasBody ccc:ci/0701-0702/1 .

ccc:id/0705 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:intrepid ;
  literal:hasLiteralValue "intrepid:0701-0702/1-2".

ccc:ci/0701-0702/1 a cito:Citation ;
  cito:hasCitingEntity ccc:br/0701 ;
  cito:hasCitedEntity ccc:br/0702 ;
  datacite:hasIdentifier ccc:id/0702 .

ccc:id/0702 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:oci ;
  literal:hasLiteralValue "oci:0701-0702/1".

# second in-text reference and citation
ccc:rp/0702 a c4o:InTextReferencePointer ;
  oco:hasAnnotation ccc:an/0703 ;
  datacite:hasIdentifier ccc:id/0706 .

ccc:an/0703 a oa:Annotation ;
  oa:hasBody ccc:ci/0701-0702/2 .

ccc:id/0706 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:intrepid ;
  literal:hasLiteralValue "intrepid:0701-0702/2-2".

ccc:ci/0701-0702/2 a cito:Citation ;
  cito:hasCitingEntity ccc:br/0701 ;
  cito:hasCitedEntity ccc:br/0702 ;
  datacite:hasIdentifier ccc:id/0703 .

ccc:id/0703 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:oci ;
  literal:hasLiteralValue "oci:0701-0702/2".

The first in-text reference rp/0701 appears as "Doe et al. 2020". It appears in the firt section de/0701, called "Introduction", second paragraph de/0702, third sentence de/0703 (being section, paragraph and sentence numbers relative to the entire document and not to the parent element). Both in-text references and the discourse elements are also identified by a XPath.

# the sentence
ccc:de/0703 a deo:DiscourseElement , doco:Sentence ;
  c4o:isContextOf ccc:rp/0701 ;
  fabio:hasSequenceIdentifier "3" ;
  datacite:hasIdentifier ccc:id/0708 .

# the in-text reference
ccc:rp/0701 c4o:hasContent "Doe et al. 2020";
  fabio:hasSequenceIdentifier "1" ;
  datacite:hasIdentifier ccc:id/0707 .

ccc:id/0707 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
  literal:hasLiteralValue "/article/body/sec[1]/p[2]/xref[1]".

ccc:id/0708 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
  literal:hasLiteralValue "substring(string(/article/body/sec[1]/p[2]),190,278)".

# the paragraph
ccc:de/0702 a deo:DiscourseElement , doco:Paragraph ;
  frbr:part ccc:de/0703 ;
  fabio:hasSequenceIdentifier "2" ;
  datacite:hasIdentifier ccc:id/0709 .

ccc:id/0709 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
  literal:hasLiteralValue "/article/body/sec[1]/p[2]".

# the section
ccc:de/0701 a deo:DiscourseElement , doco:Section ;
  dcterms:title "Introduction" ;
  frbr:part ccc:de/0702 ;
  fabio:hasSequenceIdentifier "1" ;
  datacite:hasIdentifier ccc:id/07010 .

ccc:id/07010 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
  literal:hasLiteralValue "/article/body/sec[1]".

The second in-text reference rp/0702 appears in a list of in-text references pl/0701, which includes other three pointers.

# the list of in-text references
ccc:pl/0701 a c4o:SingleLocationPointerList ;
  c4o:hasContent "(Doe et al. 2020 ; Smith 2019 ; Ellis et al .2011 ; Phillips 2020)" ;
  co:element ccc:rp/0702 , ccc:rp/0703 , ccc:rp/0704 , ccc:rp/0705 ;
  datacite:hasIdentifier ccc:id/07011 .

ccc:id/07011 a datacite:Identifier ;
  datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
  literal:hasLiteralValue "substring(string(/article/body/sec[3]/p[2]),247,66)".