Dataset source format
This page describes the source format of datasets in RiverBench. This format is used only internally and is different to the release format.
Overall file structure
This file structure will be created for you when your repository is created by an admin using the template.
.github/
– the directory with CI configuration needed to package and publish your dataset. You don't need to change anything there.LICENSE
– specifies the license for the dataset.metadata.ttl
– describes the dataset in a machine-readable manner. See the metadata documentation for more details.README.md
– auto-generated from metadata.ttl. You don't need to touch it.
You can also add more files and directories (like .gitignore
, etc.) to the repository.
Source files
- Source files must be uploaded as a GitHub release to your repository, following this guide.
- There must be exactly one source file per dataset (either
triples.tar.gz
,graphs.tar.gz
, orquads.tar.gz
). - The source file must be a
.tar.gz
archive, with a structure, as outlined below. - The archive can contain only directories (nesting is allowed) and stream element files.
- The file extension of the stream elements depends on the stream type. See subsections below for more details.
- The files must be named starting from
0000000000.Y
, and sequentially up toX.Y
, whereX + 1
is the number of stream elements in the dataset, and Y is the file extension. All numbers must be zero-padded to exactly ten digits. - Important! All files must be stored in the tar sequentially in lexicographic order. This is different to what the tar command usually does on Linux (order of files is random). See the creating a source archive section below for more details.
- There are no special rules for grouping files into directories – but examples of what would work are presented below. It is recommended to have at most ~1000 files per directory to avoid issues with filesystems and file browsers.
Example 1: flat file structure, graph stream, 431256 elements:
Example 2: files in directories, triple stream, 201900 elements:
- 0000/
- 0000000000.ttl
- 0000000001.ttl
- ...
- 0000000999.ttl
- ...
- 0201/
- 0000201000.ttl
- 0000201001.ttl
- ...
- 0000201899.ttl
Example 3: files in nested directories, triple stream, 201900 elements:
- 00/
- 00/
- 0000000000.ttl
- ...
- 0000000099.ttl
- 99/
- 0000009900.ttl
- ...
- 0000009999.ttl
- ...
- 20/
- ...
- 18/
- 0000201800.ttl
- ...
- 0000201899.ttl
Creating a source archive
The stream element files must be stored in the source archive sequentially, so that the archive can be processed by the CI jobs in a streaming manner, speeding up packaging and validation.
Let's say you have a directory named "dataset" with .ttl files (possibly in nested directories) that you want to add to the archive. On Linux you can run:
You can then verify that the files were stored sequentially in the tar by running:
You should see a list of files in the archive, in lexicographic order.
Graph stream format
In the graph stream format, every stream element is an RDF dataset, and every RDF dataset corresponds to exactly one file. In the dataset there must be exactly one named RDF graph pair <n, G>
, where G
is an RDF graph, and n
is the graph name. Apart from graph G
, the RDF dataset may contain any number of triples in the default graph. If the stream is a timestamped stream, then the default graph must include exactly one timestamp triple <n, p, t>
, where p
is the designated timestamp property, as specified in metadata.
Note
The above format specification is meant to be compatible with the draft RSP Data model, when the stream is timestamped.*
The files must be in the RDF 1.1 TriG format, or in the TriG-star format, if the dataset uses RDF-star. The extensions of the files must be .trig
. The files must be encoded in UTF-8.
Example graphs dataset: citypulse-traffic-graphs
Quad stream format
In the quad stream format, every stream element is an RDF dataset, and every RDF dataset corresponds to exactly one file. In the dataset there can be zero or more named RDF graphs, and the default graph (which may be empty).
Note
The above format specification is meant to cover all valid RDF 1.1 datasets. Because of this, a completely empty file is also a valid stream element.*
The files must be in the RDF 1.1 TriG format, or in the TriG-star format, if the dataset uses RDF-star. The extensions of the files must be .trig
. The files must be encoded in UTF-8.
Example quads dataset: nanopubs
Triple stream format
In the triple stream format, every stream element is an unnamed (default) RDF graph, and every RDF graph corresponds to exactly one file.
Note
The above format specification is meant to cover all valid RDF 1.1 graphs. Because of this, a completely empty file is also a valid stream element.*
The files must be in the RDF 1.1 Turtle format, or in the Turtle-star format, if the dataset uses RDF-star. The extensions of the files must be .ttl
. The files must be encoded in UTF-8.
Example triples dataset: yago-annotated-facts