Harvesting
EU institutions, agencies and other bodies, and the Member States (the 'data providers') are autonomous in publishing their open data. Harvesting is the recommended method for publishers who manage their data in a data catalogue.
Checklist for your portal
Data.europa.eu harvests openly shared information available on public-sector, open-data portals. If you want your portal or website to be harvested by data.europa.eu, please share your answers to the following questions via the contact form. When sharing this form, please select 'Get harvested by data.europa.eu' when answering the question 'Please choose an issue type'. Once we receive your request, we will assess it and keep you informed about its status.
Do you give consent that data.europa.eu is allowed to send email to the catalogue's publisher to inform about harvesting activities?
Please provide us with the following information about your catalogue.
- Uniform resource locator (URL) to interface (REST, CSW...).
- URL to homepage.
- Title of the catalogue.
- Description of the catalogue.
- Publisher of the catalogue.
- Email address of the catalogue.
- Default language of the catalogues datasets.
- How often can/should the site be harvested (e.g. once a week)?
- Are there any times when the site should not be harvested (e.g. scheduled maintenance)?
Technical requirements/constraints
The harvester accesses the endpoints of all catalogues mostly on a daily basis, depending on the size of a catalogue. We process the collected data overnight. We transform every incoming format to DCAT-AP 2.1.1 with a hash is built over every harvested dataset. This hash value is compared to the existing hash value before a dataset is potentially updated in our triplestore. Updates take place only when an inequality is found. The harvester is configured specifically for each harvested portal.
Access to harvested sites
Authentication
Some source sites require authentication, meaning we need a login name and password before we can access the data (here data.europa.eu).
If this applies to your portal, please state this in your message when using our contact form.
API access to harvested site
For harvesting to take place, the source site needs to have in place one of the interfaces as described in detail in the Interface supported for harvesting section.
FTP access to harvested site
Data.europa.eu does not support FTP for downloading datasets from a source site.
Interfaces supported for harvesting
The following sections describe the list of interfaces that data suppliers (e.g. national portals, public data portals in the Member States, portals from international organisations etc.) must have in place in order to be harvested by data.europa.eu.
The main supported interfaces are the following:
-
DCAT-AP / Comprehensive Knowledge Archive Network (CKAN) compliant sites (for 'normal' datasets);
-
CSW/Inspire catalogue services (for geospatial datasets);
-
OpenSearch (GEO/EOP) (for geospatial datasets).
DCAT-AP
Providing data via a DCAT-AP interface is the official recommended method and will always be preferred for harvesting.
General remarks
DCAT-AP is a metadata specification for describing public sector datasets in Europe. It's based on the data catalogue vocabulary 1. The datasets are provided as linked data and can be represented in multiple ways. For the harvesting process, any common representation like rdf/xml, n-triples or turtle is allowed.
Metadata model
For general information on the metadata model, please refer to the official documentation 2. The respective qualifiers (mandatory, recommended and optional) need to be adhered to. The following is an example dataset with all the mandatory properties in rdf/xml.
<?xml version="1.0"?>
<rdf:RDF
xmlns:edp="https://europeandataportal.eu/voc#"
xmlns:dct="http://purl.org/dc/terms/"
xmlns:spdx="http://spdx.org/rdf/terms#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dqv="http://www.w3.org/ns/dqv#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:schema="http://schema.org/"
xmlns:dcat="http://www.w3.org/ns/dcat#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dcatapde="http://dcat-ap.de/def/dcatde/">
<dcat:CatalogRecord rdf:about="http://data.europa.eu/88u/record/ded24b58-a5ab-4d34-8603-2e5b2131a6a2">
<edp:transStatus rdf:resource="https://europeandataportal.eu/voc#TransInProcess"/>
<foaf:primaryTopic>
<dcat:Dataset rdf:about="http://data.europa.eu/88u/dataset/ded24b58-a5ab-4d34-8603-2e5b2131a6a2">
<dct:temporal>
<dct:PeriodOfTime>
<schema:endDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-29T11:06:06.094165</schema:endDate>
<schema:startDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-05-29T11:05:39.811259</schema:startDate>
</dct:PeriodOfTime>
</dct:temporal>
<dct:publisher>
<foaf:Organization rdf:about="https://opendata.schleswig-holstein.de/organization/5b6d12d7-09c0-4bfc-b026-587d2a7d282e">
<foaf:name>Kreis Rendsburg-Eckernförde</foaf:name>
</foaf:Organization>
</dct:publisher>
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-29T11:06:06.093824</dct:modified>
<dcat:keyword>corona</dcat:keyword>
<dct:title>Corona-Daten Rendsburg-Eckernförde</dct:title>
<dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-29T11:06:06.093824</dct:issued>
<dcat:keyword>covid-19</dcat:keyword>
<dct:language rdf:resource="http://publications.europa.eu/resource/authority/language/DEU"/>
<dcatapde:contributorID rdf:resource="http://dcat-ap.de/def/contributors/schleswigHolstein"/>
<dcat:distribution>
<dcat:Distribution rdf:about="http://data.europa.eu/88u/distribution/a5be938b-637e-48a2-84c4-cabf323af6ee">
<dcat:downloadURL rdf:resource="https://opendata.schleswig-holstein.de/dataset/86178d63-3d83-4dc6-8e0a-98315ebdfadb/resource/d0d1e71b-824a-4b59-bf09-7cb18adb8fef/download/corona-rendsburg-eckernfoerde.json"/>
<dcat:mediaType>application/json</dcat:mediaType>
<spdx:checksum>
<spdx:Checksum>
<spdx:checksumValue rdf:datatype="http://www.w3.org/2001/XMLSchema#hexBinary"
>623cdad43e99e1d3c2bb9ba6df8ff489</spdx:checksumValue>
<spdx:algorithm rdf:resource="http://dcat-ap.de/def/hashAlgorithms/md/5"/>
</spdx:Checksum>
</spdx:checksum>
<dct:format rdf:resource="http://publications.europa.eu/resource/authority/file-type/JSON"/>
<dct:title>corona-rendsburg-eckernfoerde.json</dct:title>
<dcatapde:licenseAttributionByText>Kreis Rendsburg-Eckernförde</dcatapde:licenseAttributionByText>
<dct:identifier>https://opendata.schleswig-holstein.de/dataset/86178d63-3d83-4dc6-8e0a-98315ebdfadb/resource/d0d1e71b-824a-4b59-bf09-7cb18adb8fef</dct:identifier>
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-29T09:06:06.400937</dct:modified>
<dcat:accessURL rdf:resource="https://opendata.schleswig-holstein.de/dataset/86178d63-3d83-4dc6-8e0a-98315ebdfadb/resource/d0d1e71b-824a-4b59-bf09-7cb18adb8fef/download/corona-rendsburg-eckernfoerde.json"/>
<dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-29T09:06:06.457996</dct:issued>
<dcat:byteSize rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
>18006</dcat:byteSize>
<dct:rights rdf:resource="http://dcat-ap.de/def/licenses/cc-by/4.0"/>
<dct:license rdf:resource="http://dcat-ap.de/def/licenses/cc-by/4.0"/>
</dcat:Distribution>
</dcat:distribution>
<dct:isVersionOf rdf:resource="https://opendata.schleswig-holstein.de/dataset/f1bfb6ac-6ca9-426d-880c-e7a1257bb0d1"/>
<dcat:theme rdf:resource="http://publications.europa.eu/resource/authority/data-theme/HEAL"/>
<dct:spatial>
<dct:Location rdf:about="http://dcat-ap.de/def/politicalGeocoding/districtKey/01058">
<skos:prefLabel>Kreis Rendsburg-Eckernförde</skos:prefLabel>
</dct:Location>
</dct:spatial>
<dct:accessRights rdf:resource="http://publications.europa.eu/resource/authority/access-right/PUBLIC"/>
<dct:identifier>ded24b58-a5ab-4d34-8603-2e5b2131a6a2</dct:identifier>
<dct:description>CORONA - Aktuelle Situation im Kreis Rendsburg-Eckernförde

Pro Gemeinde sind folgende Daten verzeichnet:

- Positiv Getestete gesamt 
- Aktuell Infizierte 
- Aktuell Infizierte pro 1.000 Einwohner
- Genesene 
- Verstorbene

Der Eintrag für eine Gemeinde sieht folgendermaßen aus:

`'010585833054': { amount_pt: 2.699698269017, amount_t: 149, amount_i: 17, amount_d: 1, amount_h: 131 },`

Als Schlüssel wird der [Regionalschlüssel](https://www.dcat-ap.de/def/politicalGeocoding/regionalKey/) verwendet. Die Properties enthalten folgende Daten:

- `amount_pt` - Aktuell Infizierte pro 1.000 Einwohner
- `amount_t` - Positiv Getestete gesamt
- `amount_i` - Aktuell Infizierte 
- `amount_d` - Verstorbene
- `amount_h` - Genesene

Interaktiv und grafisch sind die Daten auf dem [Corona-Dashboard des Kreises](https://covid19dashboardrdeck.aco/) zu sehen.</dct:description>
</dcat:Dataset>
</foaf:primaryTopic>
<dqv:hasQualityMetadata rdf:resource="http://data.europa.eu/88u/metrics/ded24b58-a5ab-4d34-8603-2e5b2131a6a2"/>
<dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-31T00:06:28Z</dct:issued>
<dct:identifier>ded24b58-a5ab-4d34-8603-2e5b2131a6a2</dct:identifier>
<edp:originalLanguage>de</edp:originalLanguage>
<dct:creator rdf:resource="http://piveau.io"/>
<edp:transIssued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-31T00:06:28Z</edp:transIssued>
<spdx:checksum>
<spdx:Checksum>
<spdx:algorithm rdf:resource="http://spdx.org/rdf/terms#checksumAlgorithm_md5"/>
<spdx:checksumValue>ef0676bea69a09053ac2ba52e23f271a</spdx:checksumValue>
</spdx:Checksum>
</spdx:checksum>
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2022-07-31T00:06:28Z</dct:modified>
</dcat:CatalogRecord>
</rdf:RDF>
Categorisation
The data.europa.eu categories are based on the EU controlled data theme vocabulary. The following are the categories used on data.europa.eu.
AGRI | Agriculture, fisheries, forestry and food |
---|---|
ECON | Economy and finance |
EDUC | Education, culture and sport |
ENER | Energy |
ENVI | Environment |
GOVE | Government and public sector |
HEAL | Health |
INTR | International issues |
JUST | Justice, legal system and public safety |
REGI | Regions and cities |
SOCI | Population and society |
TECH | Science and technology |
TRAN | Transport |
TRAN Transport
When providing data, publishers should always use these terms to thematically categorise the datasets. If a different vocabulary is used, it should be aligned (i.e. mapped) to these categories.
Requests
The harvester currently supports harvesting from an open archives initiative protocol for metadata harvesting (OAI-PMH) [^3] compliant source or from reading a dump file containing the RDF/XML representation of the datasets or directly reading DCAT-AP from a SPARQL endpoint. If datasets are provided as a dump file, it is recommended to split the file into pages, for example, by using the hydra core vocabulary 3.
For OAI-PMH-compliant sources, only the verb 'ListRecords' is used.
Responses
As indicated above, the response must be DCAT-AP-compliant to be understood by the harvesting component.
Error handling
The OAI-PMH protocol provides methods for error handling that the harvester can understand. When using this protocol, these error methods should be used.
Service information for integration
As stated above, a categorisation mapping should be provided. Apart from that, the URL for the OAI-PMH endpoint or the dump file is needed.
CKAN API
The open-source data portal platform CKAN 4 is widely used for building open data platforms. Its RPC-style 5 API (action API) is supported as an interface for data suppliers of data.europa.eu. Basically the following options for using that interface are available.
-
The data supplier uses CKAN for providing its open data metadata. It is important that the used CKAN version supports the action API 6. The legacy APIs of CKAN are not supported.
-
The data supplier offers a CKAN compliant API, where the necessary endpoints reproduce the exact API behaviour.
Requests and responses
Only the 'package_search' API endpoint is needed in order to harvest the metadata. Its specifications are described in detail in the official documentation 7. This endpoint is used to get the metadata in a paginated way. Therefore it accepts query parameters in a request and returns a dictionary with datasets as a result. The high-level use of this endpoint has to be offered as specified in the CKAN documentation.
Example call: GET http://www.example.com/api/3/action/package_search?rows=50
Metadata model
Although the CKAN API can be used as is, the basic CKAN data schema was extended and modified to meet the requirements of the underlying data structure (DCAT-AP) of the data.europa.eu. The response of the 'package_search' action exposes a 'results' field, which holds a list of dictised datasets. The data structure of such a dataset differs from the one of a plain CKAN installation.
Notes:
-
Bold fields are CKAN standard. Further information in the official documentation.
-
Type specifications according to official JSON standard (http://json.org/).
-
Besides the mandatory fields, the field names and types are not strict, but a data supplier has to make sure an obvious mapping is possible.
-
For a detailed explanation of each field, refer to the DCAT-AP specifications.
Dataset schema
The following fields are mandatory.
Field | Type | DCAT-AP dataset equivalent |
---|---|---|
title | string | dct:title |
notes | string | dct:description |
The following fields are optional but highly recommended.
Field | Type | DCAT-AP dataset equivalent |
---|---|---|
contact_point | array of objects (allowed members: type, name, email, resource) | dcat:contactPoint |
tags | array of objects | dcat:keyword |
publisher | object | dct:publisher |
groups | array of objects – the name of each group needs to fit the official categorisation | dcat:theme |
resources | array of objects (see distribution schema ) | dcat:distribution |
The following fields are optional.
Field | Type | DCAT-AP dataset equivalent |
---|---|---|
conforms_to | array of objects (allowed members: label, resource) | dct:conformsTo |
creator | object | dct:creator |
accrucal_periodicity | object | dct:accrucalPeriodicity |
identifier | object | dct:identifier |
url | string | dcat:landing_page |
language | array of objects (allowed members: label, resource) | dct:language |
other_identifier | object | adms:identifier |
issued | string | dct:issued |
dcat_spatial | array of objects (allowed members: label, resource) | dct:spatial |
temporal | array of objects (allowed members: start_date, end_date) | dct:temporal |
modified | string | dct:modified |
version_info | string | owl:versionInfo |
version_notes | string | adms:versionNotes |
provenance | array of objects (allowed members: label, resource) | dct:provenance |
source | array of strings | dct:source |
access_rights | object | dct:accessRights |
has_version | array of strings | dct:hasVersion |
is_version_of | array of strings | dct:isVersionOf |
relation | array of strings | dct:relation |
page | array of strings | foaf:page |
sample | array of strings | adms:sample |
dct_type | string | dct:type |
Distribution schema
The following fields are mandatory.
Field | Type | DCAT-AP distribution equivalent |
---|---|---|
url | string | dcat:accessURL |
The following fields are optional but highly recommended.
Field | Type | DCAT-AP distribution equivalent |
---|---|---|
description | string | dct:description |
format | string | dct:format |
license | object | dct:license |
Note that the list of licences recognised by data.europa.eu's DCAT-AP parser is available online (https://data.europa.eu/en/training/licensing-assistant). This is also used by our metadata quality assessment (MQA) tool 8 for assessing the data providers' performance in using known licences.
The following fields are optional.
Field | Type | DCAT-AP distribution equivalent |
---|---|---|
checksum | object | spdx:checksum |
mimetype | string | dcat:mediaType |
download_url | array of strings | dcat:downloadURL |
issued | string | dct:issued |
status | object | adms:status |
name | string | dct:title |
modified | string | dct:modified |
rights | object | dct:rights |
page | array of strings | foaf:page |
size | number | dcat:byteSize |
language | array of objects | dct:language |
conforms_to | array of objects | dct:conformsTo |
Example
A result of the 'package_search' action looks like this.
{
"help":"http://example.eu/data/api/3/action/help_show?name=package_search",
"success":true,
"result":{
"count":113948,
"sort":"score desc, metadata_modified desc",
"facets":{
},
"results":[
{
"issued":"2011-10-20T00:00:00Z",
"id":"525abe30-ef60-4bf9-824e-916368c1fad8",
"metadata_created":"2015-09-15T12:08:54.860742",
"metadata_modified":"2015-09-15T13:17:51.405474",
"temporal":[
{
"start_date":"2011-10-19T22:00:00Z",
"end_date":"2011-10-19T22:00:00Z"
}
],
"state":"active",
"type":"dataset",
"resources":[
{
"package_id":"525abe30-ef60-4bf9-824e-916368c1fad8",
"id":"7166a1fa-d994-4d88-8e76-3378930b1e16",
"state":"active",
"format":"XHTML",
"mimetype":"application/xhtml+xml",
"name":"Example",
"created":"2015-09-15T14:39:43.865240",
"url":"http://example.com"
}
],
"tags":[
{
"vocabulary_id":null,
"state":"active",
"display_name":"Example Tag",
"id":"06993102-a2ee-4e40-b9e4-ed3e4b86e943",
"name":"example-tag"
}
],
"groups":[
{
"display_name":"Economy and finance",
"description":"",
"title":"Economy and finance",
"id":"128d0956-4526-440e-a951-f153c190d890",
"name":"economy-and-finance"
}
],
"creator_user_id":"0ab3c2ec-c2a2-4eef-b70f-ed093e028063",
"publisher":{
"resource":"http://example.com "
},
"organization":{
"description":"Example Organization",
"created":"2015-09-15T13:56:32.985936",
"title":"Example Organization",
"name":"example-orag",
"is_organization":true,
"state":"active",
"image_url":"",
"revision_id":"ea70fb1f-29a8-4e7b-8527-809e4792a75b",
"type":"organization",
"id":"0897b420-3c3d-4a19-9c2c-a9815e2db2be",
"approval_status":"approved"
},
"name":"example-dataset",
"notes":"Example",
"owner_org":"0897b420-3c3d-4a19-9c2c-a9815e2db2be",
"modified":"2011-10-20T00:00:00Z",
"url":"",
"title":"Example Dataset",
"identifier":[
"http://example-ident.com"
]
}
],
"search_facets":{
}
}
}
Translation
The following fields of datasets and distributions will be translated in 24 languages if not provided:
-
title
-
description.
CSW/INSPIRE catalogue services (for geospatial metadata)
General remarks
This interface represents an INSPIRE compliant catalogue (discovery) service 9. It is defined as a slightly extended version of the OGC CSW AP ISO 10.
The GetCapabilities operation (mandatory for all OGC Services) is not needed for running the harvesting. But this operation could be helpful upon registration of the catalogue service within the EU Data Portal as the service's response provides additional information which must otherwise be found out during the registration (e.g. the supported protocol bindings or the support of the 'modified' queryable for selective harvesting).
For the harvesting process only the GetRecords operation will be called. The GetRecordById is not needed.
Operation | Operation description | data.europa.eu usage |
---|---|---|
GetCapabilities | Retrieve catalogue service metadata | Only for gathering service information upon registration |
GetRecords | Retrieval of a bunch of metadata items | Yes |
GetRecordById | Retrieval information of single metadata items | No |
Table of OGC CSW Operations used by data.europa.eu
Metadata model
The metadata model considered is as defined in the INSPIRE Technical Guidance on Discovery Services 11 and on Metadata 12.
Within a GetRecords query (constraint) just the following metadata model elements (queryables) are used (see table).
Request parameter | Definition a | Used values in data.europa.eu | XPath b |
---|---|---|---|
Type | Provides the desired information resources. | Always the following fixed values used: 'dataset', 'datasetcollection','series' and 'service' | /gmd:MD_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode/@codeListValue |
Modified | The metadata date stamp in case of selective harvesting (if supported), see below. | Date | /gmd:MD_Metadata/gmd:dateStamp/gco:Date |
a: 'Definition' represents the semantic meaning of element in data.europa.eu, it is slightly different from the genetic meaning in OGC CSW.b: Element's XML path in GetRecords request. |
Table of GetRecords queryables (not parameters – see below)
Example query (constraint).
<Constraint version="1.1.0">
<ogc:Filter>
<ogc:Or>
<ogc:PropertyIsEqualTo>
<ogc:PropertyName>Type</ogc:PropertyName>
<ogc:Literal>dataset</ogc:Literal>
</ogc:PropertyIsEqualTo>
<ogc:PropertyIsEqualTo>
<ogc:PropertyName>Type</ogc:PropertyName>
<ogc:Literal>datasetcollection</ogc:Literal>
</ogc:PropertyIsEqualTo>
<ogc:PropertyIsEqualTo>
<ogc:PropertyName>Type</ogc:PropertyName>
<ogc:Literal>series</ogc:Literal>
</ogc:PropertyIsEqualTo>
<ogc:PropertyIsEqualTo>
<ogc:PropertyName>Type</ogc:PropertyName>
<ogc:Literal>service</ogc:Literal>
</ogc:PropertyIsEqualTo>
</ogc:Or>
</ogc:Filter>
</Constraint>
As defined in the NSPIRE Technical Guidance on Discovery Services 13 the operation must be able to return ISO19139 metadata aligned with the Inspire Technical Guidance on Metadata 14.
Requests
The mandatory GetRecords operation works as the primary means of metadata item discovery with HTTP protocol binding. It executes an inventory search and returns the metadata items. Only OGC Filter XML encoding is supported. For the GetRecords requests a few additional requirements exist. These will be explained in the following.
Bindings
One or more of HTTP POST/XML, POST/XML/SOAP1.1 and POST/XML/SOAP1.2 have to be supported as bindings.
Operation parameters
The following parameters (not the queryables) and parameter values are used in data.europa.eu for the GetRecords requests.
Request parameter | Definition a | Used values in data.europa.eu | XPath b |
---|---|---|---|
service | Tells this is a CSW service. | Always fixed value: CSW | /GetRecords@service |
version | Tells which version of CSW service is requested. | Always fixed value; 2.0.2 | /GetRecords@version |
resultType | Specifies the type of result | Always fixed value: 'results' | /GetRecords@resultType |
outputFormat | Specifies the output format of GetRecords returned document | Always fixed value: 'application/xml' | /GetRecords@outputFormat |
outputSchema | Specifies the schema of GetRecords returned document | Always fixed value (namespace):'http://www.isotc211.org/2005/gmd' | /GetRecords@outputSchema |
startPosition | Specifies the sequence number of first returned record | Used: integer between 1 and returned numberDefault value is 1 | /GetRecords@startPosition |
maxRecords | Specifies number of returned records | Supported: positive integer between 1 and N.Default value is: 50 | /GetRecords@maxRecords |
typeNames | Specifies the query- and elementSetName type | Always fixed value: 'gmd:MD_Metadata''gmd' is valid namespace prefix for 'http://www.isotc211.org/2005/gmd' | /GetRecords/Query@typeNameAnd/GetRecords/Query/ElementSetName@typeName |
ElementSetName | Specifies the type of GetRecords returned document | As only full metadata sets will be requested by the harvester this parameter will always be set to 'full'. | /GetRecords/Query/ElementSetName |
a: 'Definition' represents the semantic meaning of element in data.europa.eu. it is slightly different from the genetic meaning in OGC CSW.b: Element's XML path in GetRecords request. |
Table of GetRecords request parameters
Partitioning
For partitioning (pagination) the following parameters are used (see table on GetRecords):
-
startPosition;
-
maxRecords.
Selective harvesting
Selective harvesting allows harvesters to limit harvest requests to just those portions of the metadata available from a repository which have been changed within a specified time frame.
Selective harvesting often makes sense as this would require that only a few metadata records be harvested daily as only a few metadata records are changed within a day.
For selective harvesting the predefined queryable (usually 'modified' -- see table of GetRecords) is used.
Responses
As defined by in the INSPIRE Technical Guidance on Discovery Services 15 the operation must be able to return ISO19139 metadata aligned with the INSPIRE Technical Guidance on Metadata 16.
Partitioning
For partitioning (pagination) as part of the search response, it is mandatory to have the total count of matching metadata items returned, even if the metadata for these items is not contained in the search response. This parameter, coupled with the ability to specify the startPosition and the number of desired records (maxRecs) from the catalogue , will allow data.europa.eu to implement results paging and reducing the load on both the data.europa.eu system and on the data partners.
Error handling
Useful status and error messages help data.europa.eu manage client sessions effectively. Any limitations on submitted search requests to the inventory systems should be noted in the response (e.g. 'too many records requested', 'search timed out') so that predictable error handling can be managed by data.europa.eu.
Service information for integration
To be able to integrate an INSPIRE Discovery Service /CSW the following information need to be provided by the data supplier.
Service information | Definition a | Obligation (M=Mandatory, O=Optional, C=Conditional) | Datatype |
---|---|---|---|
GetRecords URL | URL of the CSW GetRecords operation | M | URL |
GetRecords Binding | URL of the CSW GetRecords operation | M | Codelist (one of): 'POST/XML', 'POST/XML/SOAP1.1' 'POST/XML/SOAP1.2' |
Modifieda | Name of the queryable (if supported) for filtering on the metadata date stamp (for selective harvesting) | Possibly for future use | String.[Namespace":"]QueryableName |
MaxRecordsMax | Specifies the maximal number of maximal returned records | Possibly for future use (currently always set to '50') | Integer |
a = Value in CSW filter will be formatted as 'MM-DD-YYYY'. Operators: '>=', '\<=' will be used. |
Table of Service information needed for integration
Operational requirements
Harvesting frequency
Due to the high volume of metadata that will be harvested from a growing list of data suppliers and the required runtime for the harvesting processes, each data supplier site will probably not be harvested on a daily basis. The harvesting processes have to be clustered and scheduled on a fixed time schedule (e.g. during the night) in order to avoid any load impacts on the harvested sites during regular business hours usage.
Quality of the harvested datasets
Data source site API / endpoints
The REST API of the data source site should accept queries with, for example, startPos/maxRecs parameters for resumption/partitioning of the datasets to be harvested.
Avoiding duplicates
Duplicate datasets should be avoided by the source site.
Error reporting on harvested metadata
The MQA module provides a graphical report on the quality of the harvested datasets' metadata by providing access to a dashboard that summarises the main quality indicators, for example, availability and accessibility of distributions, compliance of datasets to metadata formats, and source of violations.
The MQA dashboard can be opened directly from the portal homepage.
User feedback on datasets
Users will be able to provide feedback on a dataset directly from the dataset detail page.
The system will make it possible to gather and extract all feedback received for all datasets and group those by data supplier, so that the feedback can be sent to the data supplier.
Checklist
The goal of this checklist is to gather and summarise all main requirements for successfully harvesting a data supplier site and assure a certain quality level of the harvested datasets.
Requirement | Value | Comment | |
---|---|---|---|
1 | Make sure that your portal provides metadata | Only metadata can be harvested, not the data itself | |
2 | Which metadata standard is supported? | DCAT-AP/CKAN/ISO19139(Inspire) | |
3 | Which representation of the metadata is used? | XML/JSON or any RDF representation | |
4 | Which API is used to retrieve the data? | CKAN/OAI-PMH/RDF dump file/SPARQL endpoint/CSW | |
5 | Is authentication required for you to access your API? | yes/no | |
6 | Include complete vocabulary for categorisation, or other fields that use a defined vocabulary (for example update frequency) | With translation, if applicable | |
7 | Use standard date/time formats | ISO8601 | |
8 | How often can/should the site be harvested? | daily/weekly/monthly/etc. | |
9 | What shall be the title of the catalogue? | ||
10 | What shall be the description of the catalogue? | ||
11 | Who is the publisher of the catalogue (name and email address) | ||
12 | Which end point would you like us to harvest? |
-
http://www.w3.org/TR/vocab-dcat/ ↩
-
https://www.openarchives.org/pmh/ ↩
-
https://www.hydra-cg.com/spec/latest/core/ ↩
-
http://ckan.org/ ↩
-
Remote procedure call ↩
-
http://docs.ckan.org/en/ckan-2.4.0/api/index.html#action-api-reference ↩
-
http://docs.ckan.org/en/ckan-2.4.0/api/index.html#ckan.logic.action.get.package_search ↩
-
See https://www.europeandataportal.eu/mqa?locale=en ↩
-
Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) ↩
-
OGC Catalogue Services Specification 2.0.2 – ISO metadata application profile : corrigendum, No 1.0.1, Open Geospatial Consortium, 7 March 2018, OGC 07-045rl (https://portal.ogc.org/files/80534) ↩
-
Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) ↩
-
Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139). ↩
-
Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) ↩
-
Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139). ↩
-
Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) ↩
-
Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139). ↩