The Open Biomedical Citations in Context Corpus (a.k.a. CCC) is a RDF dataset, developed thanks to the financial support of the Wellcome Trust, which includes bibliographic and citation data mined from the full-text of articles collected in the Open Access Subset of PubMed Central. Articles served as JATS/XML documents and have been harvested by using the Europe PubMed Central REST API.
Similarly to the OpenCitations Corpus,
CCC includes information about bibliographic entities such as
bibliographic resources (br), resource embodiments (re), bibliographic references (be),
citations (ci), responsible agents (ra), agents' roles (ar),
and identifiers (id).
In addition, CCC includes detailed information about in-text references (rp) -
e.g. (Daquino et al. 2020)
-, groupings of in-text references (pl), discourse elements (de) - including sentences, paragraphs, footnotes, captions, tables, sections -, and citation annotations (an).
Read about the OCDM model for further details.
It's worth noting that CCC extends OpenCitations identifiers to address the following aspects:
XPath: in-text reference pointers and discourse elements mined from JATS/XML documents are identified by means of a local identifier, that is, a XPath selector (e.g. /article/body/section[1]/p[3]/xref[1]
).
XPath identifiers allow one to parse the XML source document and to extract the full-text of the entity at hand.
However, while strings identifying in-text reference pointers are available (e.g. "(Daquino et al. 2020)"
, "[13-18]"
), the full-text of sentences, paragraphs, and sections, is not available due to Intellectual Property restrictions.
CCC includes XPath identifiers of all the in-text references, and only of discourse elements including at least one in-text reference.
Sequence number:
along with XPath identifiers, discourse elements are identified with a more human-readable sequence number
(e.g. Section n. 1
, Paragraph n. 3
, Table n. 2
),
indentifying their relative position in the document. CCC stores sequence numbers of discourse elements that include at least one in-text reference.
OCI: an OCI is a global persistent identifier of citations. It usually appears in the form oci:<citing>-<cited>
where citing
and cited
are locally assigned numerical identifiers of respectively a citing document and a cited document.
In CCC an OCI is assigned to both the general citation - in the same form <citing>-<cited>
- and to every occurrence of an in-text reference in the citing document relevant to that citation.
For instance: the article identified as 0701
in CCC cites the article identified as 07090
, and two in-text references appear in the citing article referencing the cited article.
The general OCI for the citation will be 0701-07090
, while the two specific citations instatiated by in-text references will be addressed as 0701-07090/1
and 0701-07090/2
respectively.
InTRePID: the In-Text Reference Pointer Identifier (InTRePID) is a global unique persistent identifier (PID) of in-text reference pointers.
InTRePID is an extention of OCI that appears in the following form: intrepid:<oci>/<ordinal>-<total>
where <oci>
is is the numerical part of the OCI identiying a citation between a citing and cited entity,
<ordinal>
is the nth occurrence of an in-text reference pointer within the text of the citing entity
relevant to the cited entity addressed in the OCI, and <total>
is the total number of in-text reference pointers
that appear in the full-text of the citing entity relevant to the (same) cited entity.
Following the prior example, the two in-text references addressing the citation between 0701
and 07090
will be respectively associated with the following intrepids: intrepid:0701-07090/1-2
and intrepid:0701-07090/2-2
The corpus is composed by several subsets, one for each type of entity.
For instance https://w3id.org/oc/ccc/rp/
is the subset of all the in-text reference pointers.
Each entity of the subset is identified by an incremental number, unique within the subset (e.g. https://w3id.org/oc/ccc/rp/1
).
Data are served in JSON-LD (see the context.json file) and each subset is described by means of the Data Catalog Vocabulary and the VoID Vocabulary.
CCC data are available as dumps on Figshare
CCC is (proudly) released under a Creative Commons CC0 public domain waiver. For this reason, CCC does not include information that may fall under IP restrictions (e.g. full-text of articles). When searching and browsing data via CCC web search interface, text of sentences including in-text references are shown to users for a better experience. However, such data are not persistently stored in CCC and its reuse must follow publishers' rules.
Here below a brief example (in .ttl syntax) of some aspects already addressed in the OpenCitations Corpus and new ones introduced in CCC.
The journal article br/0701
cites the journal article br/0702
.
The journal article br/0702
is referenced in the list of bibliographic references by the reference identified as be/0701
.
# a simple citation
ccc:br/0701 a fabio:JournalArticle ;
cito:cites ccc:br/0702 ;
frbr:part ccc:be/0701 .
ccc:br/0702 a fabio:JournalArticle .
ccc:be/0701 a biro:BibliographicReference ;
biro:references ccc:br/0702 ;
oco:hasAnnotation ccc:an/0701 .
ccc:ci/0701-0702 a cito:Citation ;
cito:hasCitingEntity ccc:br/0701 ;
cito:hasCitedEntity ccc:br/0702 ;
datacite:hasIdentifier ccc:id/1 .
ccc:id/1 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:oci ;
literal:hasLiteralValue "oci:0701-0702".
ccc:an/0701 a oa:Annotation ;
oa:hasBody ccc:ci/0701-0702 .
In br/0701
there are two in-text references (rp/0701
and rp/0702
) denoting the bibliographic reference be/0701
.
Each in-text reference is identified by an InTRePID, and it is associated with an annotation that refers to a (new, specific) citation.
# first in-text reference and citation
ccc:rp/0701 a c4o:InTextReferencePointer ;
oco:hasAnnotation ccc:an/0702 ;
datacite:hasIdentifier ccc:id/0705 .
ccc:an/0702 a oa:Annotation ;
oa:hasBody ccc:ci/0701-0702/1 .
ccc:id/0705 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:intrepid ;
literal:hasLiteralValue "intrepid:0701-0702/1-2".
ccc:ci/0701-0702/1 a cito:Citation ;
cito:hasCitingEntity ccc:br/0701 ;
cito:hasCitedEntity ccc:br/0702 ;
datacite:hasIdentifier ccc:id/0702 .
ccc:id/0702 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:oci ;
literal:hasLiteralValue "oci:0701-0702/1".
# second in-text reference and citation
ccc:rp/0702 a c4o:InTextReferencePointer ;
oco:hasAnnotation ccc:an/0703 ;
datacite:hasIdentifier ccc:id/0706 .
ccc:an/0703 a oa:Annotation ;
oa:hasBody ccc:ci/0701-0702/2 .
ccc:id/0706 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:intrepid ;
literal:hasLiteralValue "intrepid:0701-0702/2-2".
ccc:ci/0701-0702/2 a cito:Citation ;
cito:hasCitingEntity ccc:br/0701 ;
cito:hasCitedEntity ccc:br/0702 ;
datacite:hasIdentifier ccc:id/0703 .
ccc:id/0703 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:oci ;
literal:hasLiteralValue "oci:0701-0702/2".
The first in-text reference rp/0701
appears as "Doe et al. 2020"
. It appears in the firt section de/0701
, called "Introduction",
second paragraph de/0702
, third sentence de/0703
(being section, paragraph and sentence numbers relative to the entire document and not to the parent element).
Both in-text references and the discourse elements are also identified by a XPath.
# the sentence
ccc:de/0703 a deo:DiscourseElement , doco:Sentence ;
c4o:isContextOf ccc:rp/0701 ;
fabio:hasSequenceIdentifier "3" ;
datacite:hasIdentifier ccc:id/0708 .
# the in-text reference
ccc:rp/0701 c4o:hasContent "Doe et al. 2020";
fabio:hasSequenceIdentifier "1" ;
datacite:hasIdentifier ccc:id/0707 .
ccc:id/0707 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
literal:hasLiteralValue "/article/body/sec[1]/p[2]/xref[1]".
ccc:id/0708 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
literal:hasLiteralValue "substring(string(/article/body/sec[1]/p[2]),190,278)".
# the paragraph
ccc:de/0702 a deo:DiscourseElement , doco:Paragraph ;
frbr:part ccc:de/0703 ;
fabio:hasSequenceIdentifier "2" ;
datacite:hasIdentifier ccc:id/0709 .
ccc:id/0709 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
literal:hasLiteralValue "/article/body/sec[1]/p[2]".
# the section
ccc:de/0701 a deo:DiscourseElement , doco:Section ;
dcterms:title "Introduction" ;
frbr:part ccc:de/0702 ;
fabio:hasSequenceIdentifier "1" ;
datacite:hasIdentifier ccc:id/07010 .
ccc:id/07010 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
literal:hasLiteralValue "/article/body/sec[1]".
The second in-text reference rp/0702
appears in a list of in-text references pl/0701
, which includes other three pointers.
# the list of in-text references
ccc:pl/0701 a c4o:SingleLocationPointerList ;
c4o:hasContent "(Doe et al. 2020 ; Smith 2019 ; Ellis et al .2011 ; Phillips 2020)" ;
co:element ccc:rp/0702 , ccc:rp/0703 , ccc:rp/0704 , ccc:rp/0705 ;
datacite:hasIdentifier ccc:id/07011 .
ccc:id/07011 a datacite:Identifier ;
datacite:usesIdentifierScheme datacite:local-resource-identifier-scheme ;
literal:hasLiteralValue "substring(string(/article/body/sec[3]/p[2]),247,66)".