Skip to content

Datasets API#

Required properties#

A published dataset has the following required properties. Drafts require only data_catalog and title.

Field key value
Data Catalog data_catalog str
Title title dict
Description description dict
Access Rights access_rights object
Actors actors object, at least one "creator" role and exactly one "publisher" role
Persistent identifier persistent_identifier str

Persistent identifier#

A persistent identifier (PID) uniquely identifies the Dataset. The data catalog field allowed_pid_types determines what kind of identifiers are supported for datasets in the catalog. The possible values allowed_pid_types values are: - "external": External identifier not maintained by Metax. - "URN": URN (Uniform Resource Name) identifier generated by Metax. - "DOI": DOI (Digital Object Identifier) identifier generated by Metax.

An external persistent identifier is assigned directly to the dataset, e.g. "persistent_identifier": "value-of-pid". The provider of the dataset is responsible for creating and maintaining the identifier.

To request a generated persistent identifier, include the wanted identifier type in the generate_pid_on_publish field, e.g. "generate_pid_on_publish": "URN" and leave persistent_identifier undefined. When the dataset is published successfully, the persistent_identifier is set by Metax and the dataset will have "pid_generated_by_fairdata": true. The persistent identifier of a dataset with a generated PID cannot be changed afterwards.

Access Rights#

Information about who can access the resource or an indication of its security status. 2

A dataset whose files are publicly available should have access type "Open" from the reference data. Access type is used by Etsin to determine who can download IDA files from the dataset. The supported types are:

  • Open: Anyone can access the data.
  • Login: Only logged in users can access the data.
  • Embargo: Data can accessed starting from available date. If available is not set, data will not become accessible.
  • Permit: Data can be accessed only after requesting permission. Not implemented yet.
  • Restricted: Data access is not allowed.

For the restrictive access types it is recommended to add one or more restriction grounds values to indicate why access to the data is restricted.

The available access rights fields are:

Field key value
Description description dict
Access Type access_type (required) reference data from /v3/reference-data/access-types
Restriction Grounds restriction_grounds list of reference data from /v3/reference-data/restriction-grounds
Available date available date when data is available for download
License license list of objects

License is special kind of reference data object, as it can have additional metadata properties that are writable by the user:

  • custom_url
  • description

The custom_url and description fields allow specifying additional information to the selected license. If custom_url is set without providing url, the "Other" license is used by default.

If the dataset has a license that is not in the reference data, choose the best matching "Other" type license in url and add a URL to the actual license as custom_url and/or describe the license in description.

Example

{
  "access_rights": {
    "id": "39f48c3d-1f9b-43a8-9a59-7bf38e1ec2f4",
    "description": {
      "en": "Access is granted freely to anyone in the world, license restrictions apply."
    },
    "license": [
      {
        "custom_url": "https://creativecommons.org/publicdomain/zero/1.0/legalcode",
        "description": {
          "en": "The dataset is available for research, teaching and study."
        },
        "id": "3c37ceab-7138-41f8-8fcd-c93624ffe483",
        "url": "http://uri.suomi.fi/codelist/fairdata/license/code/CC0-1.0",
        "in_scheme": "http://uri.suomi.fi/codelist/fairdata/license",
        "pref_label": {
          "en": "Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication",
          "fi": "Creative Commons Yleismaailmallinen (CC0 1.0) Public Domain -lausuma"
        }
      }
    ],
    "access_type": {
      "id": "20f133ae-6026-4e73-b0f8-6aab18a4963f",
      "url": "http://uri.suomi.fi/codelist/fairdata/access_type/code/open",
      "in_scheme": "http://uri.suomi.fi/codelist/fairdata/access_type",
      "pref_label": {
        "en": "Open",
        "fi": "Avoin"
      }
    }
  }
}

Data Catalog#

This is the id of the Data Catalog object that can be seen in /v3/data-catalogs list.

Actors#

Dataset actors describe persons and organizations that have had a role in creating the dataset.

Each actor has an organization and optionally a person. Actors may have one more roles from the list of supported roles: creator, publisher, curator, rights_holder, contributor. Each dataset should at have at least one creator.

Actors, persons and organizations each have an id value that can be used when creating or updating actors: - If no id is provided, identical objects are considered the same. Changes to the object will produce a new object that replaces the existing one. - If an id is provided, all objects with same id are considered the same. This allows updating existing objects. - A temporary id in format #value can be used to create a new object for each different id value. The actual id is generated by Metax. - Object can be provided with only id field if it already exists in the dataset or elsewhere in the request.

Actor, person and organization objects are not shared across datasets (except for reference data) and will have different id values in different datasets even.

Reference data organizations are listed in /v3/organizations. When updating a dataset, reference data organizations are identified by url or id. Using url is preferred as it will be the same across different Metax instances. Any changes to reference data organizations in a request are ignored. The fields url and in_scheme have values only for reference data organizations.

As a special case, roles are ignored when determining which actors are same. Repeated instances of same object in the actors list will be merged into one actor with all the values in roles.

Example

{
  "actors": [
    {
      "roles": [
        "creator"
      ],
      "organization": {
        "url": "http://uri.suomi.fi/codelist/fairdata/organization/code/09206320",
        "in_scheme": "http://uri.suomi.fi/codelist/fairdata/organization",
        "pref_label": {
          "en": "CSC – IT Center for Science",
          "fi": "CSC - Tieteen tietotekniikan keskus Oy",
          "sv": "CSC – IT Center for Science",
          "und": "CSC - Tieteen tietotekniikan keskus Oy"
        }
      },
      "person": {
        "name": "John Doe"
      }
    },
    {
      "roles": [
        "publisher"
      ],
      "organization": {
        "url": "http://uri.suomi.fi/codelist/fairdata/organization/code/09206320",
        "in_scheme": "http://uri.suomi.fi/codelist/fairdata/organization",
        "pref_label": {
          "en": "CSC – IT Center for Science",
          "fi": "CSC - Tieteen tietotekniikan keskus Oy",
          "sv": "CSC – IT Center for Science",
          "und": "CSC - Tieteen tietotekniikan keskus Oy"
        }
      },
      "person": {
        "name": "Jane Doe"
      }
    }
  ]
}

Publishing datasets#

By default, new datasets are created as drafts and have "state": "draft". Draft datasets are not visible to the general public until they are published. To publish a draft dataset, send a POST request to /v3/datasets/<id>/publish.

To create a dataset and immediately publish it in a single request, create it with "state": "publish". The state of an existing dataset can only be changed with the dataset publishing endpoint.

Updating published datasets#

Changes to a published dataset are public immediately. It is also possible to create a linked temporary draft dataset with /v3/datasets/<id>/create-draft. This returns a copy of the dataset with a new id and a draft_of field that refers to the original dataset. To apply changes made to the linked draft, send a POST request to /v3/datasets/<id of draft>/publish. This will update the published dataset and delete the linked draft dataset.

Removing a published dataset with DELETE /v3/datasets/<id> will hide it from listings by default but it will still be available with the ?include_removed=true query parameter. Deleting a draft removes it permanently.

Optional properties#

There are multiple optional fields can be used to provide additional information about the dataset:

Field key value
Issued date issued date
Keywords keyword list of str
Field of Science field_of_science list of reference data from /v3/reference-data/fields-of-science
Language language list of reference data from /v3/reference-data/language
Research Infrastructure infrastructure list of reference data from /v3/reference-data/research-infras
Theme theme list of reference data from /v3/reference-data/themes
Other identifiers other_identifiers list of object
Provenance provenance list of object
Spatial coverage spatial list of object
Temporal coverage temporal list of object
Remote resources remote_resources list of object
Metadata owner metadata_owner object

Language#

A language of the resource. This refers to the natural language used for textual metadata (i.e., titles, descriptions, etc.) of a cataloged resource (i.e., dataset or service) or the textual values of a dataset distribution. 1

Language field is a list of language reference data objects. Only url field is required to add language reference. Both of definitions below are valid objects:

Example

{
  "language": [
    {
      "url": "http://lexvo.org/id/iso639-3/fin"
    },
    {
      "url": "http://lexvo.org/id/iso639-3/eng",
      "pref_label": {
        "fi": "Englanti",
        "en": "English"
      },
      "in_scheme": "http://lexvo.org/id/"
    }
  ]
}

Spatial coverage#

Spatial coverage describes the spatial characteristics of the dataset. The optional reference field should contain a location from reference data. The custom_wkt field allows specifying geometry as WKT strings in WGS84 coordinate system.

Example

{
  "reference": {
    "in_scheme": "http://www.yso.fi/onto/yso/places",
    "url": "http://www.yso.fi/onto/onto/yso/c_9908ce39",
    "pref_label": {
      "fi": "Alppikylä (Helsinki)",
      "sv": "Alpbyn (Helsingfors)"
    }
  },
  "full_address": "Alppikylä",
  "geographic_name": "Alppikylä"
}

Temporal coverage#

Temporal coverage describes the temporal characteristics of the resource.

Specify start_date and end_date date values to indicate a period of time. Only one of the values is required, e.g. end_date can be left out to signify an ongoing process.

Example

{
  "temporal": [
    {
      "start_date": "2023-09-20",
      "end_date": "2023-11-25"
    },
    {
      "start_date": "2023-12-23"
    }
  ]
}

Relations#

The relation list allows describing other entities that a dataset is related to.

Example

{
  "relation": [
    {
      "entity": {
        "title": {
          "en": "Another dataset"
        },
        "description": {
          "en": "This is the description of a dataset."
        },
        "entity_identifier": "doi:some_dataset",
        "type": {
          "url": "http://uri.suomi.fi/codelist/fairdata/resource_type/code/dataset"
        }
      },
      "relation_type": {
        "url": "http://purl.org/spar/cito/cites"
      }
    }
  ]
}

Remote Resources#

Remote resources allow associating dataset with data available on the Internet. Dataset files and remote resources are exclusive with each other, so a dataset cannot have both.

Example

{
  "remote_resources": [
    {
      "title": {
        "en": "Documentation for the thing"
      },
      "description": null,
      "use_category": {
        "url": "http://uri.suomi.fi/codelist/fairdata/use_category/code/documentation"
      },
      "download_url": "https://example.com/doc.txt",
      "access_url": null,
      "file_type": {
        "url": "http://uri.suomi.fi/codelist/fairdata/file_type/code/text"
      },
      "mediatype": "text/plain",
      "checksum": "sha256:a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447"
    }
  ]
}

Metadata owner#

Metadata owner describes who owns the dataset. Normally this is determined automatically on dataset creation based on authenticated user and can be omitted. Special privileges are required to change the automatically assigned values.

{
  "metadata_owner": {
    "user": "test_user",
    "organization": "test_organization"
  }
}

Versions#

Revisions are named changes to singular dataset. Single dataset can have as many revisions as it needs. Versions are two or more datasets that represents series of datasets in the same set. Versions never share same persistent identifier or id. Versions are only created when user explicitly wants to create one. Revisions are created automatically.

Enabling versioning on dataset#

Dataset needs to be on datacatalog with dataset_versioning_enabled set to true.

Working with versions#

Datasets have a dataset_versions field has links to other versions of the dataset.

Dataset files#

A dataset can have files associated with it, and associated files and directories can have additional dataset-specific metadata. All associated files have to be from the same file storage (e.g. same IDA project).

When viewing a dataset with GET /v3/datasets/<id>, the response includes a summary of its file data in the fileset object. For example:

Example

{
  ...
  "fileset": {
      "storage_service": "ida",
      "csc_project": "project",
      "total_files_count": 2,
      "total_files_size": 2048
  }
}

Browsing dataset files#

Dataset files can be viewed as a flat list or browsed as a directory tree:

  • GET /v3/datasets/<id>/files View flat list of dataset files.
  • GET /v3/datasets/<id>/directories View root directory of dataset files.
  • GET /v3/datasets/<id>/directories?path=<path> View content of path <path>, e.g. ?path=/data/subdir/.

The endpoints support the same parameters as corresponding /v3/files and /v3/directories endpoints and use pagination by default.

The returned files and directories have dataset-specified metadata included in the dataset_metadata field, or null if metadata is not set.

Adding, updating or removing dataset files#

Files be added to and removed from unpublished draft datasets. Files can be added to published datasets only if the dataset is cumulative or if there are no existing files in the dataset.

To modify dataset file associations, include file storage parameters (i.e. storage_service, project) and a directory_actions or file_actions list in fileset object when creating or updating a dataset.

For example, to update files of an existing dataset, use PATCH /v3/datasets/<id> with {"fileset": <fileset object>}.

The fileset object should look like

{
  "storage_service": <service, e.g. ida>,
  "csc_project": <project>,
  "directory_actions": [
    {
      "pathname": <path, e.g. /data/>,
      "action": <action>,
      "dataset_metadata": <directory metadata object>
    }
  ],
  "file_actions": [
    {
      "id": <id>,
      "action": <action>
      "dataset_metadata": <file metadata object>
    }
  ]
}

where the optional action value is is one of

  • "add" (default): Add file or all files in directory to dataset, update dataset_metadata if present.
  • "update": Only update dataset_metadata without adding or removing files.
  • "remove": Remove file or all files in directory and subdirecories from dataset.

In addition to id, file actions also support identifying files by storage_identifier or pathname.

Note

Operations are performed based on the end result of doing all actions in the listed order but with directory_actions always before file_actions. E.g. removing /data/ and adding /data/subdir/ in the same request will remove everything from /data/ except subdir. Any metadata attached to subdir will remain.

If dataset_metadata is present but set to null, existing metadata will be removed. Metadata is also removed if the file or directory is no longer in the dataset after the operations.

The response fileset object will include the normal fileset summary and additional values added_files_count and removed_files_count which tell how many files were added and how many files were removed by the operations.

Cumulative datasets#

Cumulative datasets are datasets where new files can be added after publication without creating a new dataset version. The cumulation status can be changed using the cumulative_state parameter. It has the following possible values:

  • 0: (default): Non-cumulative. Files cannot be added after publication.
  • 1: Cumulation active. Files can be added even after publication.
  • 2: Cumulation closed. Files can no longer be added.

If a dataset has been published, the only allowed change to cumulative_state is from active to closed.

Dataset file and directory metadata fields#

The following fields are supported in dataset-specific file and directory metadata:

Field key value
Title title str
Description description str
File type (files only) file_type reference data from /v3/reference-data/file-type
Use category use_category reference data from /v3/reference-data/use-categories

Examples#

Creating minimal dataset with files from directory#

Example

POST /v3/datasets

{
  "data_catalog": "urn:nbn:fi:att:data-catalog-ida",
  "title": {
    "en": "Test dataset"
  },
  "fileset": {
    "csc_project": "test_project",
    "storage_service": "ida",
    "directory_actions": [
      {
        "pathname": "/data/",
        "dataset_metadata": {
          "title": "Test results",
          "use_category": {
            "url": "http://uri.suomi.fi/codelist/fairdata/use_category/code/outcome"
          }
        }
      }
    ]
  }
}