Skip to content

Stable: 2.1.0

Creating a new dataset

This guide explains step-by-step how to propose a new dataset for inclusion in RiverBench. You only need to prepare the dataset and its metadata – the rest of the process will be carried out by a RiverBench admin and automated scripts.

Step 0: Check the requirements

Before you start, have a look at the requirements for new datasets. If your dataset does not meet these requirements, it will not be accepted.

Step 1: Create a dataset proposal

Open a new dataset proposal in the RiverBench repository: New dataset proposal

Fill in the fields with the required information, using the instructions embedded in the form.

Note

If you have trouble filling in any of the fields, you can leave them blank and ask the maintainer for help.

Step 2: Wait for approval

An administrator will be notified your request and will review the form and the dataset. The administrator may ask for additional information or clarifications. Once reviewed, the admin will create a new repository for you and give you access to it.

Step 3: Upload the dataset sources

  1. Create a source archive by following the guide on preparing a source archive.
  2. Access the repository created for your dataset and click on "Releases" in the right sidebar.
  3. Click on the "Create a new release" button to start the process of creating a new release for your dataset.
  4. Fill in the following fields for the new release:
    • Input "source" as the tag for the release.
    • Enter "Source" as the release name.
    • Check the "Set as a pre-release" option (it's below the large text field).
    • Leave other options unchanged.
  5. Upload the prepared source archive (source.tar.gz) by dragging and dropping the file into the designated area.
  6. Once the source archive is attached, click on the "Publish release" button to finalize the upload.

Step 4: Fill out the metadata

  1. Open the metadata.ttl file in your new dataset repository.
  2. Use the information from the issue template you filled out earlier to complete the required fields in the metadata.ttl file. Replace the placeholder text with the appropriate information from the template.
    • You can have a look at the metadata of other datasets for reference.
    • In the dcterms:description field and other free-text fields you can use Markdown formatting.
    • For dcat:theme use concepts from the EuroVoc thesaurus. Use only elements of type "Concept" (without a number in their name), not "Concept scheme" or "Domain concept".
  3. Open the LICENSE file and replace the placeholder text with the license of the dataset. You can find commonly used templates here.
  4. Save your changes and commit to the main branch.
  5. Inform the administrator in your issue that you have completed the metadata for your dataset. The admin will then finalize adding the dataset to the suite and provide any necessary assistance.

Instructions for admins

  • Review the issue template – make sure all required information is provided.
  • Create a new repository for the dataset with name dataset-[IDENTIFIER]. In the repository settings:
  • Use the RiverBench/dataset-template repository as the template.
  • Mark the repo as public.
  • Add the dataset maintainer as a collaborator to the repository in repo settings.
  • Reply in the issue to the maintainer with the link to the repository and a link to step 3 of this guide.
  • After the maintainer completes steps 3 and 4, check if the CI passes correctly up to the dataset and documentation update steps (these should fail). If not, try to fix the issue.
  • Go to the organization secret settings. For secrets PAT_DOC_REPO_HOOKS, PAT_MAIN_REPO_HOOKS, and PAT_DATASET_CAT_REPO_HOOKS add repository access for the new dataset repository.
  • In Zenodo settings enable the new repository.
  • Create a new branch in the main repo (RiverBench/RiverBench) for the proposal issue.
  • In the new branch, run git submodule add ../dataset-[ID] datasets/[ID].
  • Commit and push changes to GitHub.
  • Create a pull request for the branch and merge it to main.
  • Re-run the CI in the dataset repo and check if the dataset and documentation update steps pass correctly.
  • After all CI finishes check if the dataset list and profiles were updated correctly. Check the dataset's documentation page for any obvious issues.

See also