Skip to content

Dataset: yago-annotated-facts (development version)

This is a subset of the YAGO 4 knowledge base (paper), based on Wikidata, version from February 24, 2020. This dataset includes only the fact annotations in RDF-star, that is facts about facts. Each stream element corresponds to one item in Wikidata.

Stream preview (click to expand)
0000000000.ttl
<< <http://yago-knowledge.org/resource/_Q56236170> <http://schema.org/dissolutionDate> "2009"^^<http://www.w3.org/2001/XMLSchema#gYear> >>
        <http://schema.org/endDate>    "2009-12"^^<http://www.w3.org/2001/XMLSchema#gYearMonth>;
        <http://schema.org/startDate>  "2009-06"^^<http://www.w3.org/2001/XMLSchema#gYearMonth> .
0000000010.ttl
<< <http://yago-knowledge.org/resource/Open_Science_Radio_Q18744554> <http://schema.org/creator> <http://yago-knowledge.org/resource/Matthias_Fromm_Q18748012> >>
        <http://schema.org/startDate>  "2013-01-02"^^<http://www.w3.org/2001/XMLSchema#date> .

<< <http://yago-knowledge.org/resource/Open_Science_Radio_Q18744554> <http://schema.org/creator> <http://yago-knowledge.org/resource/Konrad_Förstner_Q18744528> >>
        <http://schema.org/startDate>  "2014-01-19"^^<http://www.w3.org/2001/XMLSchema#date> .
0000000100.ttl
<< <http://yago-knowledge.org/resource/Carnaval_na_avenida_Central,_atual_avenida_Rio_Branco_Q65621070> <http://schema.org/dateCreated> "1906-06-22"^^<http://www.w3.org/2001/XMLSchema#date> >>
        <http://schema.org/endDate>    "1906"^^<http://www.w3.org/2001/XMLSchema#gYear>;
        <http://schema.org/startDate>  "1906"^^<http://www.w3.org/2001/XMLSchema#gYear> .
0000001000.ttl
<< <http://yago-knowledge.org/resource/Margherita_Cagol> <http://schema.org/nationality> <http://yago-knowledge.org/resource/Kingdom_of_Italy> >>
        <http://schema.org/endDate>    "1946-06-18"^^<http://www.w3.org/2001/XMLSchema#date>;
        <http://schema.org/startDate>  "1945-04-08"^^<http://www.w3.org/2001/XMLSchema#date> .

<< <http://yago-knowledge.org/resource/Margherita_Cagol> <http://schema.org/nationality> <http://yago-knowledge.org/resource/Italy> >>
        <http://schema.org/endDate>    "1975-06-05"^^<http://www.w3.org/2001/XMLSchema#date>;
        <http://schema.org/startDate>  "1946-06-18"^^<http://www.w3.org/2001/XMLSchema#date> .
0000010000.ttl
<< <http://yago-knowledge.org/resource/Mihrengiz_Kadın> <http://schema.org/nationality> <http://yago-knowledge.org/resource/Ottoman_Empire> >>
        <http://schema.org/endDate>    "1923"^^<http://www.w3.org/2001/XMLSchema#gYear>;
        <http://schema.org/startDate>  "1869"^^<http://www.w3.org/2001/XMLSchema#gYear> .

General information

Technical metadata

  • Has stream type usage:
    • RDF stream type usage (​1)
    • RDF stream type usage (​2)
      • Type: RDF stream type usage (stax:RdfStreamTypeUsage)
      • Comment: The dataset can be viewed as a stream of graphs. Each graph corresponds to the RDF-star annotations of one Wikidata item. (en)
      • Has stream type: RDF subject graph stream (stax:subjectGraphStream)
  • Has stream element count: 617,768
  • Has stream element split:
    • Type: Stream elements split by topic (rb:TopicStreamElementSplit)
    • Comment: Every stream element corresponds to one Wikidata item. (en)
    • Has subject shape:
      • Comment: Custom target – subject of any quoted triple in the subject position. (en)
      • Target custom: YAGO annotated facts target (rb:yagoTarget)
  • Uses vocabulary: http://schema.org/
  • Conforms to W3C RDF 1.1 specification: no
  • Conforms to W3C RDF-star draft specification as of December 17, 2021: yes
  • Uses generalized triples: no
  • Uses generalized RDF datasets: no
  • Uses RDF-star: yes

Distributions

The dataset is published in a few size variants, each containing a specific number of stream elements. For each size, there are three distribution types available: flat (an N-Triples/N-Quads file in the RDF Message Log format), streaming (a .tar.gz archive with Turtle/TriG files, one file per stream element), and Jelly (a native binary format for streaming RDF). See the documentation for more details.

Distribution size Statements Flat Streaming Jelly
10K 22,977 267.8 KB 376.3 KB 257.1 KB
100K 226,648 2.5 MB 3.6 MB 2.7 MB
Full 2,484,547 29.4 MB 36.2 MB 28.9 MB

The full metadata of all distributions can be found below.

Full flat distribution

Full stream distribution

Full Jelly distribution

100K elements flat distribution

100K elements stream distribution

100K elements Jelly distribution

10K elements flat distribution

10K elements stream distribution

10K elements Jelly distribution

Statistics

Statistics for full distributions

  • Title: Statistics for full distributions
Sum Unique Mean St. dev. Min. Max.
IRIs 3,631,687 ~591,866 5.88 3.22 3 853
Blank nodes 0 N/A 0.00 0.00 0 0
Literals 1,736,327 ~57,521 2.81 2.50 1 66
Simple literals 211 ~174 0.00 0.02 0 3
Datatype literals 1,736,116 ~57,356 2.81 2.50 1 66
Language literals 0 ~0 0.00 0.00 0 0
Datatypes 647,861 6 1.05 0.22 1 3
ASCII control chars 0 N/A 0.00 0.00 0 0
Quoted triples 2,484,547 N/A 4.02 6.10 1 1,455
Subjects 2,009,932 ~1,905,199 3.25 3.04 2 850
Predicates 1,622,855 ~75 2.63 0.48 2 3
Objects 3,127,393 ~165,841 5.06 5.06 1 853
Graphs 617,768 ~1 1.00 0.00 1 1
Statements 2,484,547 N/A 4.02 6.10 1 1,455
Bytes per statement N/A N/A 336.15 642.39 0.78 311,033.00

Statistics for 100K distributions

  • Title: Statistics for 100K distributions
Sum Unique Mean St. dev. Min. Max.
IRIs 502,972 ~102,634 5.03 5.30 3 853
Blank nodes 0 N/A 0.00 0.00 0 0
Literals 187,612 ~37,278 1.88 0.98 1 49
Simple literals 66 ~66 0.00 0.03 0 3
Datatype literals 187,546 ~37,218 1.88 0.97 1 49
Language literals 0 ~0 0.00 0.00 0 0
Datatypes 110,424 5 1.10 0.31 1 3
ASCII control chars 0 N/A 0.00 0.00 0 0
Quoted triples 226,648 N/A 2.27 9.28 1 1,455
Subjects 246,103 ~238,851 2.46 5.24 2 850
Predicates 257,646 ~32 2.58 0.49 2 3
Objects 332,939 ~52,703 3.33 5.46 1 853
Graphs 100,000 ~1 1.00 0.00 1 1
Statements 226,648 N/A 2.27 9.28 1 1,455
Bytes per statement N/A N/A 291.52 1,282.61 0.78 311,033.00

Statistics for 10K distributions

  • Title: Statistics for 10K distributions
Sum Unique Mean St. dev. Min. Max.
IRIs 49,533 ~10,228 4.95 0.93 3 10
Blank nodes 0 N/A 0.00 0.00 0 0
Literals 19,576 ~7,330 1.96 0.90 1 8
Simple literals 0 ~0 0.00 0.00 0 0
Datatype literals 19,576 ~7,330 1.96 0.90 1 8
Language literals 0 ~0 0.00 0.00 0 0
Datatypes 11,195 3 1.12 0.33 1 3
ASCII control chars 0 N/A 0.00 0.00 0 0
Quoted triples 22,977 N/A 2.30 1.34 1 10
Subjects 23,762 ~23,838 2.38 0.53 2 7
Predicates 26,100 ~12 2.61 0.49 2 3
Objects 33,009 ~7,548 3.30 1.39 1 13
Graphs 10,000 ~1 1.00 0.00 1 1
Statements 22,977 N/A 2.30 1.34 1 10
Bytes per statement N/A N/A 293.82 197.27 11.50 2,500.00