Datasets API#
Required properties#
A published dataset has the following required properties. Drafts require only data_catalog
and title
.
Field | key | value |
---|---|---|
Data Catalog | data_catalog | str |
Title | title | dict |
Description | description | dict |
Access Rights | access_rights | object |
Actors | actors | object, at least one "creator" role and exactly one "publisher" role |
Persistent identifier | persistent_identifier | str |
Persistent identifier#
A persistent identifier (PID) uniquely identifies the Dataset.
The data catalog field allowed_pid_types
determines
what kind of identifiers are supported for datasets in the catalog.
The possible values allowed_pid_types
values are:
- "external"
: External identifier not maintained by Metax.
- "URN"
: URN (Uniform Resource Name) identifier generated by Metax.
- "DOI"
: DOI (Digital Object Identifier) identifier generated by Metax.
An external persistent identifier is assigned directly to the dataset,
e.g. "persistent_identifier": "value-of-pid"
.
The provider of the dataset is responsible for creating and maintaining the identifier.
To request a generated persistent identifier, include the wanted
identifier type in the generate_pid_on_publish
field,
e.g. "generate_pid_on_publish": "URN"
and leave persistent_identifier
undefined.
When the dataset is published successfully, the persistent_identifier
is set by Metax
and the dataset will have "pid_generated_by_fairdata": true
.
The persistent identifier of a dataset with a generated PID cannot be changed afterwards.
Access Rights#
Information about who can access the resource or an indication of its security status. 2
A dataset whose files are publicly available should have access type "Open" from the reference data. Access type is used by Etsin to determine who can download IDA files from the dataset. The supported types are:
- Open: Anyone can access the data.
- Login: Only logged in users can access the data.
- Embargo: Data can accessed starting from
available
date. Ifavailable
is not set, data will not become accessible. - Permit: Data can be accessed only after requesting permission. Not implemented yet.
- Restricted: Data access is not allowed.
For the restrictive access types it is recommended to add one or more restriction grounds values to indicate why access to the data is restricted.
The available access rights fields are:
Field | key | value |
---|---|---|
Description | description | dict |
Access Type | access_type (required) | reference data from /v3/reference-data/access-types |
Restriction Grounds | restriction_grounds | list of reference data from /v3/reference-data/restriction-grounds |
Available date | available | date when data is available for download |
License | license | list of objects |
License is special kind of reference data object, as it can have additional metadata properties that are writable by the user:
- custom_url
- description
The custom_url
and description
fields allow specifying additional information to the selected license.
If custom_url
is set without providing url
, the "Other" license is used by default.
If the dataset has a license that is not in the reference data, choose the best matching
"Other" type license in url
and add a URL to the actual license as custom_url
and/or describe the license in description
.
Example
{
"access_rights": {
"id": "39f48c3d-1f9b-43a8-9a59-7bf38e1ec2f4",
"description": {
"en": "Access is granted freely to anyone in the world, license restrictions apply."
},
"license": [
{
"custom_url": "https://creativecommons.org/publicdomain/zero/1.0/legalcode",
"description": {
"en": "The dataset is available for research, teaching and study."
},
"id": "3c37ceab-7138-41f8-8fcd-c93624ffe483",
"url": "http://uri.suomi.fi/codelist/fairdata/license/code/CC0-1.0",
"in_scheme": "http://uri.suomi.fi/codelist/fairdata/license",
"pref_label": {
"en": "Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication",
"fi": "Creative Commons Yleismaailmallinen (CC0 1.0) Public Domain -lausuma"
}
}
],
"access_type": {
"id": "20f133ae-6026-4e73-b0f8-6aab18a4963f",
"url": "http://uri.suomi.fi/codelist/fairdata/access_type/code/open",
"in_scheme": "http://uri.suomi.fi/codelist/fairdata/access_type",
"pref_label": {
"en": "Open",
"fi": "Avoin"
}
}
}
}
Data Catalog#
This is the id of the Data Catalog object that can be seen in /v3/data-catalogs
list.
Actors#
Dataset actors describe persons and organizations that have had a role in creating the dataset.
Each actor has an organization and optionally a person. Actors may have one more roles from the
list of supported roles: creator
, publisher
, curator
, rights_holder
, contributor
. Each dataset
should at have at least one creator
.
Actors, persons and organizations each have an id
value that can be used when creating or updating actors:
- If no id
is provided, identical objects are considered the same. Changes to the object
will produce a new object that replaces the existing one.
- If an id
is provided, all objects with same id
are considered the same. This allows updating
existing objects.
- A temporary id
in format #value
can be used to create a new object for each different id
value.
The actual id
is generated by Metax.
- Object can be provided with only id
field if it already exists in the dataset or elsewhere in the request.
Actor, person and organization objects are not shared across datasets
(except for reference data) and will have different id
values in different datasets even.
Reference data organizations are listed in /v3/organizations
. When updating a dataset,
reference data organizations are identified by url
or id
.
Using url
is preferred as it will be the same across different Metax instances.
Any changes to reference data organizations in a request are ignored. The fields
url
and in_scheme
have values only for reference data organizations.
As a special case, roles
are ignored when determining which actors are same. Repeated instances
of same object in the actors
list will be merged into one actor with all the values in roles
.
Example
{
"actors": [
{
"roles": [
"creator"
],
"organization": {
"url": "http://uri.suomi.fi/codelist/fairdata/organization/code/09206320",
"in_scheme": "http://uri.suomi.fi/codelist/fairdata/organization",
"pref_label": {
"en": "CSC – IT Center for Science",
"fi": "CSC - Tieteen tietotekniikan keskus Oy",
"sv": "CSC – IT Center for Science",
"und": "CSC - Tieteen tietotekniikan keskus Oy"
}
},
"person": {
"name": "John Doe"
}
},
{
"roles": [
"publisher"
],
"organization": {
"url": "http://uri.suomi.fi/codelist/fairdata/organization/code/09206320",
"in_scheme": "http://uri.suomi.fi/codelist/fairdata/organization",
"pref_label": {
"en": "CSC – IT Center for Science",
"fi": "CSC - Tieteen tietotekniikan keskus Oy",
"sv": "CSC – IT Center for Science",
"und": "CSC - Tieteen tietotekniikan keskus Oy"
}
},
"person": {
"name": "Jane Doe"
}
}
]
}
Publishing datasets#
By default, new datasets are created as drafts and have "state": "draft"
. Draft datasets
are not visible to the general public until they are published. To publish a draft dataset,
send a POST
request to /v3/datasets/<id>/publish
.
To create a dataset and immediately publish it in a single request, create it with "state": "publish"
.
The state of an existing dataset can only be changed with the dataset publishing endpoint.
Updating published datasets#
Changes to a published dataset are public immediately. It is also possible to
create a linked temporary draft dataset with /v3/datasets/<id>/create-draft
.
This returns a copy of the dataset with a new id
and a draft_of
field that refers
to the original dataset. To apply changes made to the linked draft, send a POST
request to
/v3/datasets/<id of draft>/publish
. This will update the published dataset and delete
the linked draft dataset.
Removing a published dataset with DELETE /v3/datasets/<id>
will hide it from listings
by default but it will still be available with the ?include_removed=true
query parameter.
Deleting a draft removes it permanently.
Optional properties#
There are multiple optional fields can be used to provide additional information about the dataset:
Field | key | value |
---|---|---|
Issued date | issued | date |
Keywords | keyword | list of str |
Field of Science | field_of_science | list of reference data from /v3/reference-data/fields-of-science |
Language | language | list of reference data from /v3/reference-data/language |
Research Infrastructure | infrastructure | list of reference data from /v3/reference-data/research-infras |
Theme | theme | list of reference data from /v3/reference-data/themes |
Other identifiers | other_identifiers | list of object |
Provenance | provenance | list of object |
Spatial coverage | spatial | list of object |
Temporal coverage | temporal | list of object |
Remote resources | remote_resources | list of object |
Metadata owner | metadata_owner | object |
Language#
A language of the resource. This refers to the natural language used for textual metadata (i.e., titles, descriptions, etc.) of a cataloged resource (i.e., dataset or service) or the textual values of a dataset distribution. 1
Language field is a list of language reference data objects. Only url field is required to add language reference. Both of definitions below are valid objects:
Example
Spatial coverage#
Spatial coverage describes the spatial characteristics of the dataset.
The optional reference
field should contain a location from reference data.
The custom_wkt
field allows specifying geometry as WKT strings in WGS84 coordinate system.
Example
Temporal coverage#
Temporal coverage describes the temporal characteristics of the resource.
Specify start_date
and end_date
date values to indicate a period of time.
Only one of the values is required, e.g. end_date
can be left out to signify
an ongoing process.
Example
Relations#
The relation
list allows describing other entities that a dataset is related to.
Example
{
"relation": [
{
"entity": {
"title": {
"en": "Another dataset"
},
"description": {
"en": "This is the description of a dataset."
},
"entity_identifier": "doi:some_dataset",
"type": {
"url": "http://uri.suomi.fi/codelist/fairdata/resource_type/code/dataset"
}
},
"relation_type": {
"url": "http://purl.org/spar/cito/cites"
}
}
]
}
Remote Resources#
Remote resources allow associating dataset with data available on the Internet. Dataset files and remote resources are exclusive with each other, so a dataset cannot have both.
Example
{
"remote_resources": [
{
"title": {
"en": "Documentation for the thing"
},
"description": null,
"use_category": {
"url": "http://uri.suomi.fi/codelist/fairdata/use_category/code/documentation"
},
"download_url": "https://example.com/doc.txt",
"access_url": null,
"file_type": {
"url": "http://uri.suomi.fi/codelist/fairdata/file_type/code/text"
},
"mediatype": "text/plain",
"checksum": "sha256:a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447"
}
]
}
Metadata owner#
Metadata owner describes who owns the dataset. Normally this is determined automatically on dataset creation based on authenticated user and can be omitted. Special privileges are required to change the automatically assigned values.
Versions#
Revisions are named changes to singular dataset. Single dataset can have as many revisions as it needs. Versions are two or more datasets that represents series of datasets in the same set. Versions never share same persistent identifier or id. Versions are only created when user explicitly wants to create one. Revisions are created automatically.
Enabling versioning on dataset#
Dataset needs to be on datacatalog with dataset_versioning_enabled
set to true.
Working with versions#
Datasets have a dataset_versions
field has links to other versions of the dataset.
Dataset files#
A dataset can have files associated with it, and associated files and directories can have additional dataset-specific metadata. All associated files have to be from the same file storage (e.g. same IDA project).
When viewing a dataset with GET /v3/datasets/<id>
, the response includes a summary of its file data in the fileset
object. For example:
Example
Browsing dataset files#
Dataset files can be viewed as a flat list or browsed as a directory tree:
GET /v3/datasets/<id>/files
View flat list of dataset files.GET /v3/datasets/<id>/directories
View root directory of dataset files.GET /v3/datasets/<id>/directories?path=<path>
View content of path<path>
, e.g.?path=/data/subdir/
.
The endpoints support the same parameters as corresponding
/v3/files
and /v3/directories
endpoints and use pagination by default.
The returned files and directories have dataset-specified metadata included
in the dataset_metadata
field, or null
if metadata is not set.
Adding, updating or removing dataset files#
Files be added to and removed from unpublished draft datasets. Files can be added to published datasets only if the dataset is cumulative or if there are no existing files in the dataset.
To modify dataset file associations, include file storage parameters
(i.e. storage_service
, project
) and a directory_actions
or file_actions
list in
fileset object when creating or updating a dataset.
For example, to update files of an existing dataset, use PATCH /v3/datasets/<id>
with {"fileset": <fileset object>}
.
The fileset object should look like
{
"storage_service": <service, e.g. ida>,
"csc_project": <project>,
"directory_actions": [
{
"pathname": <path, e.g. /data/>,
"action": <action>,
"dataset_metadata": <directory metadata object>
}
],
"file_actions": [
{
"id": <id>,
"action": <action>
"dataset_metadata": <file metadata object>
}
]
}
where the optional action value is is one of
- "add" (default): Add file or all files in directory to dataset, update
dataset_metadata
if present. - "update": Only update
dataset_metadata
without adding or removing files. - "remove": Remove file or all files in directory and subdirecories from dataset.
In addition to id
, file actions also support identifying files by storage_identifier
or pathname
.
Note
Operations are performed based on the end result of doing all actions in the listed order but
with directory_actions
always before file_actions
. E.g. removing /data/
and
adding /data/subdir/
in the same request will remove everything from /data/
except subdir
. Any metadata attached to subdir
will remain.
If dataset_metadata
is present but set to null
, existing metadata will be removed.
Metadata is also removed if the file or directory is no longer in the dataset after the operations.
The response fileset object will include the normal fileset summary and additional
values added_files_count
and removed_files_count
which tell how many
files were added and how many files were removed by the operations.
Cumulative datasets#
Cumulative datasets are datasets where new files can be added after publication without
creating a new dataset version. The cumulation status can be changed using the cumulative_state
parameter. It has the following possible values:
0
: (default): Non-cumulative. Files cannot be added after publication.1
: Cumulation active. Files can be added even after publication.2
: Cumulation closed. Files can no longer be added.
If a dataset has been published, the only allowed change to cumulative_state
is from
active to closed.
Dataset file and directory metadata fields#
The following fields are supported in dataset-specific file and directory metadata:
Field | key | value |
---|---|---|
Title | title | str |
Description | description | str |
File type (files only) | file_type | reference data from /v3/reference-data/file-type |
Use category | use_category | reference data from /v3/reference-data/use-categories |
Examples#
Creating minimal dataset with files from directory#
Example
POST /v3/datasets
{
"data_catalog": "urn:nbn:fi:att:data-catalog-ida",
"title": {
"en": "Test dataset"
},
"fileset": {
"csc_project": "test_project",
"storage_service": "ida",
"directory_actions": [
{
"pathname": "/data/",
"dataset_metadata": {
"title": "Test results",
"use_category": {
"url": "http://uri.suomi.fi/codelist/fairdata/use_category/code/outcome"
}
}
}
]
}
}