How the portal works
The data model we use
The DCAT-AP is a specification based on W3C's data catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use is to give the public sector access to better data across borders and sectors. This can be achieved by the exchange of descriptions of data sets among data portals.
The specification of the DCAT-AP was a joint initiative of the Directorate-General for Communications Networks, Content and Technology, the OP and the interoperable Europe programme. The specification was elaborated by a multi-disciplinary working group with representatives from 16 EU Member States, some European institutions, and the United States.
DCAT-AP version used
The portal uses the latest version of DCAT-AP. The current version of DCAT-AP in the portal is version 2.1.0. These are the improvements of this version:
-
improved Unified Modelling Language (UML) diagram in accordance with the agreed profile reading;
-
improved coherency between the UML diagram and the specification text;
-
a usage guide on the relationships between dataset, distribution and data service; and the consequences of this clarification on the model;
-
various editorial fixes;
-
consolidation of the SHACL shapes;
-
minor specification updates:
-
introduction of the named authority list (NAL) planned-availability, NAL access-right and NAL dataset-type,
-
lift of the max-cardinality for dataset dct:type,
-
lift of the max-cardinality for property dct:creator,
-
allow other than SHA1 checksum algorithms,
-
the range for temporal properties is enlarged to contain any temporal XSD (XML schema definitions) datatype,
-
alignment of usage notes for used property adms:status with W3C DCAT,
-
addition of max-cardinality 1 for dcat:temporalResolution and dcat:spatialResolutionInMeters to align with the usage note.
-
A complete list of the issues and their resolutions can be found on the DCAT-AP GitHub.
In September 2021, the interoperable Europe programme of the European Commission started the minor release cycle for DCAT-AP aiming to address requests for change received from users through the GitHub repository. DCAT-AP working group members and users were invited to share their comments by 15 November 2021.
Good practices for metadata
1. Mis-encoded DOI
In the EU ODP we had:
giving this in the data of two entries:
<dcterms:identifier>10.2830/75445</dcterms:identifier>
<dcterms:identifier>DOI:</dcterms:identifier>
and in the EU ODP legacy: (no link)
How to fix that problem
It should be encoded like this in the EU ODP legacy:
Result of the fix
in the EU ODP legacy: (the link works)
in data.europa.eu:
(DOI is not shown yet)
in the data (directly from the EU ODP legacy):
<rdf:Description rdf:about="http://www.w3.org/ns/adms#Identifier/34ef9846-cbdf-4503-81de-afd3569484ff">
<rdf:type rdf:resource="http://www.w3.org/ns/adms#Identifier"/>
<skos:notation rdf:datatype="http://publications.europa.eu/resource/authority/notation-type/DOI">10.2830/75445</skos:notation>
<skos:notation rdf:datatype="http://purl.org/spar/datacite/doi">10.2830/75445</skos:notation>
<adms:schemeAgency>Publications Office</adms:schemeAgency>
<dcterms:creator rdf:resource="http://publications.europa.eu/resource/authority/corporate-body/PUBL"/>
<dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2021-05-04T14:04:29</dcterms:issued>
</rdf:Description>
Here is the list of datasets and related titles for Publications Office as publisher (PUBL):
Fix: change the title.
3. Assigning license
The choice of a license should be discussed with the data provider. Following the EC decision from 2019, Directorates-General should try to publish their reusable content under CC BY, but for publications ordered before 2019 things are not as clear. In any case it's their decision. Reuse guidelines CC BY.
4. Assigning DOI
DOIs for datasets are minted upon request. If the publication already has a DOI, no need to assign it (always check if it has a DOI --- very often reports and another documents have one already). When assigning DOIs to datasets with data, we should double-check with the provider.
[Provision of DOI Data Services](https://myintracomm-collab.ec.europa.eu/networks/EDPSPECS/_layouts/15/WopiFrame.aspx?sourcedoc=/networks/EDPSPECS/Shared%20Documents/AO-10801_Annexes/Annexes%20from%20Annex%2012/Specifications(DOI).(03.08.2018.(v.0.2).docx&action=default)
5. Duplication of contact address
http://data.europa.eu/88u/dataset/eu-open-data-portal-api
The same thing is visible on the EU ODP legacy:
In fact, the problem is that we have a duplication of a landing page:
https://data.europa.eu/euodp/en/data/dataset/edit/eu-open-data-portal-api
Fix: delete the duplicate HTTP address. (I also took the opportunity to update the contact form address and delete the '+' at the end of the email address).
Result:
6. Not coherent use of DEPRECATED tag in title
<dcterms:title xml:lang="mt">[DEPRECATED] Data tal-Coronavirus COVID-19</dcterms:title>
<dcterms:title xml:lang="de">[DEPRECATED] COVID-19 – Daten zum Coronavirus</dcterms:title>
.../...
<dcterms:title xml:lang="sl">[DEPRECATED] Podatki o koronavirusu COVID-19</dcterms:title>
<dcterms:title xml:lang="cs">[DEPRECATED] Koronavirus COVID-19 – data</dcterms:title>
<dcterms:title xml:lang="lt">[DEPRECATED] Koronaviruso (COVID-19) duomenys</dcterms:title>
<dcterms:title xml:lang="bg">[DEPRECATED] Данни за коронавируса COVID-19</dcterms:title>
<dcterms:title xml:lang="en">[DEPRECATED] COVID-19 Coronavirus data - daily (up to 14 December 2020)</dcterms:title>
<dcterms:title xml:lang="fr">OBSOLETE! Données relatives au coronavirus COVID-19</dcterms:title>
<dcterms:title xml:lang="nl">[DEPRECATED] Data over het coronavirus (COVID-19)</dcterms:title>
<dcterms:title xml:lang="sk">[DEPRECATED] Údaje o ochorení koronavírusom COVID-19</dcterms:title>
observation: 'fr' value is different (OBSOLETE!)
Fix: change it to [DEPRECATED] (not yet done)
7. Link of file is wrong
Observation: the link is wrong, the right link is: https://data.europa.eu/euodp/repository/CDT/OP_Covid19_IATE_2872020.xlsx.gz
Fix: correct the URL of the link.
8. Internal data.europa.eu files link not in HTTPS
All data.europa.eu websites and services should be in HTTPS if possible. Looking at the list of files available for download on the site shows that 95% of them are in https, but there are also some in http:
http://data.europa.eu/euodp/data/storage/f/2014-06-24T133025/influenza-surveillance-overview-11-oct-2013.pdf\ http://data.europa.eu/euodp/data/storage/f/2014-06-24T135149/131018-SUR-Weekly-Influenza-Surveillance-Overview.pdf\ http://data.europa.eu/euodp/data/storage/f/2014-06-24T135352/influenza-surveillance-overview-15-nov-2013.pdf\ http://data.europa.eu/euodp/data/storage/f/2014-06-24T135513/influenza-surveillance-overview-22-nov-2013.pdf\ http://data.europa.eu/euodp/data/storage/f/2014-06-24T135632/influenza-weekly-surveillance-overview-29-nov-2013.pdf\ http://data.europa.eu/euodp/data/storage/f/2014-06-24T135738/influenza-weekly-surveillance-overview-6-dec-2013.pdf\ http://data.europa.eu/euodp/data/storage/f/2014-06-24T135908/influenza-surveillance-overview-13-dec-2013.pdf
Fix: add the 's' to 'http' in the related files in related distributions of related datasets in the EU ODP legacy:
9. Link to internal file (in EU ODP) while the file is already in the website of the data provider
Problem: the 'download URL' is a file in the EU ODP while the access URL is already OK and in the data provider website.
Fix: recopy the 'access URL' to the 'download URL'.
10. Type of dataset
The list of types of datasets is old and not exhaustive. It was discussed in DCAT-AP group but the conclusion is that there are various dimensions of 'types' and it's impossible to create a list that would cover all cases. Some types can be useful because they can allow us to group some datasets: statistical, etc.
11. Duplication of dataset with a DOI
In the EU ODP legacy, when you duplicate a dataset with a DOI it also duplicates the DOI. This is a bug from the old system. We will ask the consortium to make a rule that doesn't allow DOIs to be copied for the new back office.
Our advice: before duplicating a dataset, check if the dataset you want to duplicate has a DOI -- if it has one, don't duplicate it but create it from scratch.
12. Change of publisher
-
First we need to verify if the dataset changed ownership. If the same dataset moved from one publisher to another, the best practice is to modify the value for the publisher. Following the creation of sub-catalogues by publisher for the former ODP catalogue, the following steps have to be followed in order to avoid creating duplicates.
-
Save the dataset as 'DRAFT'. This will result in the dataset being removed from the portal upon re-harvesting (c. 2 hours).
-
Verify that the dataset is no longer visible on the public portal.
-
In the back office, change the publisher to the new one and save as published (allow time for re-harvesting for changes to show on portal).
-
-
If the dataset did not change ownership but the original publisher no longer exists, you may leave the dataset with the original publisher, preferably adding '[DEPRECATED]' in the title field. Note that the authority table for corporate bodies maintains the values for deprecated bodies and these are listed with an end date and their status is set to deprecated.
13. One dataset should be in fact several datasets
See https://github.com/SEMICeu/DCAT-AP/blob/master/releases/2.1.0/usageguide-dataset-distribution-dataservice.md: 'there might be need for a granularity clarification between datasets and distributions. Commonly, at first sight, it is expected that all distributions of a dataset are indentical [sic] in content, only differing in the representation of the data.'
14. Duplication of distributions
In https://data.europa.eu/data/datasets/latest-asylum-trends?locale=en, the same distribution appears twice, once as pure 'distribution' and once as 'visualisation'. But for both of them, it is the same link.
It is, therefore, better to have only one visualisation (less duplicate data for the end user).