Skip to content

Harvesting

EU institutions, agencies and other bodies, and the Member States (the 'data providers') are autonomous in publishing their open data. Harvesting is the recommended method for publishers who manage their data in a data catalogue.

Checklist for your portal

Data.europa.eu harvests openly shared information available on public-sector, open-data portals. If you want your portal or website to be harvested by data.europa.eu, please share your answers to the following questions via the contact form. When sharing this form, please select 'Get harvested by data.europa.eu' when answering the question 'Please choose an issue type'. Once we receive your request, we will assess it and keep you informed about its status.

check-list-portal

Do you give consent that data.europa.eu is allowed to send email to the catalogue's publisher to inform about harvesting activities?

Please provide us with the following information about your catalogue.

  • Uniform resource locator (URL) to interface (REST, CSW...).
  • URL to homepage.
  • Title of the catalogue.
  • Description of the catalogue.
  • Publisher of the catalogue.
  • Email address of the catalogue.
  • Default language of the catalogues datasets.
  • How often can/should the site be harvested (e.g. once a week)?
  • Are there any times when the site should not be harvested (e.g. scheduled maintenance)?

Technical requirements/constraints

The harvester accesses the endpoints of all catalogues mostly on a daily basis, depending on the size of a catalogue. We process the collected data overnight. We transform every incoming format to DCAT-AP 2.1.1 with a hash is built over every harvested dataset. This hash value is compared to the existing hash value before a dataset is potentially updated in our triplestore. Updates take place only when an inequality is found. The harvester is configured specifically for each harvested portal.

Access to harvested sites

Authentication

Some source sites require authentication, meaning we need a login name and password before we can access the data (here data.europa.eu).

If this applies to your portal, please state this in your message when using our contact form.

API access to harvested site

For harvesting to take place, the source site needs to have in place one of the interfaces as described in detail in the Interface supported for harvesting section.

FTP access to harvested site

Data.europa.eu does not support FTP for downloading datasets from a source site.

Interfaces supported for harvesting

The following sections describe the list of interfaces that data suppliers (e.g. national portals, public data portals in the Member States, portals from international organisations etc.) must have in place in order to be harvested by data.europa.eu.

The main supported interfaces are the following:

  • DCAT-AP / Comprehensive Knowledge Archive Network (CKAN) compliant sites (for 'normal' datasets);

  • CSW/Inspire catalogue services (for geospatial datasets);

  • OpenSearch (GEO/EOP) (for geospatial datasets).

DCAT-AP

Providing data via a DCAT-AP interface is the official recommended method and will always be preferred for harvesting.

General remarks

DCAT-AP is a metadata specification for describing public sector datasets in Europe. It's based on the data catalogue vocabulary 1. The datasets are provided as linked data and can be represented in multiple ways. For the harvesting process, any common representation like rdf/xml, n-triples or turtle is allowed.

Metadata model

For general information on the metadata model, please refer to the official documentation 2. The respective qualifiers (mandatory, recommended and optional) need to be adhered to. The following is an example dataset with all the mandatory properties in rdf/xml.

<?xml version="1.0"?>
<rdf:RDF
        xmlns:edp="https://europeandataportal.eu/voc#"
        xmlns:dct="http://purl.org/dc/terms/"
        xmlns:spdx="http://spdx.org/rdf/terms#"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:dqv="http://www.w3.org/ns/dqv#"
        xmlns:skos="http://www.w3.org/2004/02/skos/core#"
        xmlns:schema="http://schema.org/"
        xmlns:dcat="http://www.w3.org/ns/dcat#"
        xmlns:foaf="http://xmlns.com/foaf/0.1/"
        xmlns:dcatapde="http://dcat-ap.de/def/dcatde/">
    <dcat:CatalogRecord rdf:about="http://data.europa.eu/88u/record/ded24b58-a5ab-4d34-8603-2e5b2131a6a2">
        <edp:transStatus rdf:resource="https://europeandataportal.eu/voc#TransInProcess"/>
        <foaf:primaryTopic>
            <dcat:Dataset rdf:about="http://data.europa.eu/88u/dataset/ded24b58-a5ab-4d34-8603-2e5b2131a6a2">
                <dct:temporal>
                    <dct:PeriodOfTime>
                        <schema:endDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
                        >2022-07-29T11:06:06.094165</schema:endDate>
                        <schema:startDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
                        >2022-05-29T11:05:39.811259</schema:startDate>
                    </dct:PeriodOfTime>
                </dct:temporal>
                <dct:publisher>
                    <foaf:Organization rdf:about="https://opendata.schleswig-holstein.de/organization/5b6d12d7-09c0-4bfc-b026-587d2a7d282e">
                        <foaf:name>Kreis Rendsburg-Eckernförde</foaf:name>
                    </foaf:Organization>
                </dct:publisher>
                <dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
                >2022-07-29T11:06:06.093824</dct:modified>
                <dcat:keyword>corona</dcat:keyword>
                <dct:title>Corona-Daten Rendsburg-Eckernförde</dct:title>
                <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
                >2022-07-29T11:06:06.093824</dct:issued>
                <dcat:keyword>covid-19</dcat:keyword>
                <dct:language rdf:resource="http://publications.europa.eu/resource/authority/language/DEU"/>
                <dcatapde:contributorID rdf:resource="http://dcat-ap.de/def/contributors/schleswigHolstein"/>
                <dcat:distribution>
                    <dcat:Distribution rdf:about="http://data.europa.eu/88u/distribution/a5be938b-637e-48a2-84c4-cabf323af6ee">
                        <dcat:downloadURL rdf:resource="https://opendata.schleswig-holstein.de/dataset/86178d63-3d83-4dc6-8e0a-98315ebdfadb/resource/d0d1e71b-824a-4b59-bf09-7cb18adb8fef/download/corona-rendsburg-eckernfoerde.json"/>
                        <dcat:mediaType>application/json</dcat:mediaType>
                        <spdx:checksum>
                            <spdx:Checksum>
                                <spdx:checksumValue rdf:datatype="http://www.w3.org/2001/XMLSchema#hexBinary"
                                >623cdad43e99e1d3c2bb9ba6df8ff489</spdx:checksumValue>
                                <spdx:algorithm rdf:resource="http://dcat-ap.de/def/hashAlgorithms/md/5"/>
                            </spdx:Checksum>
                        </spdx:checksum>
                        <dct:format rdf:resource="http://publications.europa.eu/resource/authority/file-type/JSON"/>
                        <dct:title>corona-rendsburg-eckernfoerde.json</dct:title>
                        <dcatapde:licenseAttributionByText>Kreis Rendsburg-Eckernförde</dcatapde:licenseAttributionByText>
                        <dct:identifier>https://opendata.schleswig-holstein.de/dataset/86178d63-3d83-4dc6-8e0a-98315ebdfadb/resource/d0d1e71b-824a-4b59-bf09-7cb18adb8fef</dct:identifier>
                        <dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
                        >2022-07-29T09:06:06.400937</dct:modified>
                        <dcat:accessURL rdf:resource="https://opendata.schleswig-holstein.de/dataset/86178d63-3d83-4dc6-8e0a-98315ebdfadb/resource/d0d1e71b-824a-4b59-bf09-7cb18adb8fef/download/corona-rendsburg-eckernfoerde.json"/>
                        <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
                        >2022-07-29T09:06:06.457996</dct:issued>
                        <dcat:byteSize rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"
                        >18006</dcat:byteSize>
                        <dct:rights rdf:resource="http://dcat-ap.de/def/licenses/cc-by/4.0"/>
                        <dct:license rdf:resource="http://dcat-ap.de/def/licenses/cc-by/4.0"/>
                    </dcat:Distribution>
                </dcat:distribution>
                <dct:isVersionOf rdf:resource="https://opendata.schleswig-holstein.de/dataset/f1bfb6ac-6ca9-426d-880c-e7a1257bb0d1"/>
                <dcat:theme rdf:resource="http://publications.europa.eu/resource/authority/data-theme/HEAL"/>
                <dct:spatial>
                    <dct:Location rdf:about="http://dcat-ap.de/def/politicalGeocoding/districtKey/01058">
                        <skos:prefLabel>Kreis Rendsburg-Eckernförde</skos:prefLabel>
                    </dct:Location>
                </dct:spatial>
                <dct:accessRights rdf:resource="http://publications.europa.eu/resource/authority/access-right/PUBLIC"/>
                <dct:identifier>ded24b58-a5ab-4d34-8603-2e5b2131a6a2</dct:identifier>
                <dct:description>CORONA - Aktuelle Situation im Kreis Rendsburg-Eckernförde&#xD;
                    &#xD;
                    Pro Gemeinde sind folgende Daten verzeichnet:&#xD;
                    &#xD;
                    - Positiv Getestete gesamt  &#xD;
                    - Aktuell Infizierte &#xD;
                    - Aktuell Infizierte pro 1.000 Einwohner&#xD;
                    - Genesene &#xD;
                    - Verstorbene&#xD;
                    &#xD;
                    Der Eintrag für eine Gemeinde sieht folgendermaßen aus:&#xD;
                    &#xD;
                    `'010585833054': { amount_pt: 2.699698269017, amount_t: 149, amount_i: 17, amount_d: 1, amount_h: 131 },`&#xD;
                    &#xD;
                    Als Schlüssel wird der [Regionalschlüssel](https://www.dcat-ap.de/def/politicalGeocoding/regionalKey/) verwendet. Die Properties enthalten folgende Daten:&#xD;
                    &#xD;
                    - `amount_pt` - Aktuell Infizierte pro 1.000 Einwohner&#xD;
                    - `amount_t` - Positiv Getestete gesamt&#xD;
                    - `amount_i` - Aktuell Infizierte &#xD;
                    - `amount_d` - Verstorbene&#xD;
                    - `amount_h` - Genesene&#xD;
                    &#xD;
                    Interaktiv und grafisch sind die Daten auf dem [Corona-Dashboard des Kreises](https://covid19dashboardrdeck.aco/) zu sehen.</dct:description>
            </dcat:Dataset>
        </foaf:primaryTopic>
        <dqv:hasQualityMetadata rdf:resource="http://data.europa.eu/88u/metrics/ded24b58-a5ab-4d34-8603-2e5b2131a6a2"/>
        <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
        >2022-07-31T00:06:28Z</dct:issued>
        <dct:identifier>ded24b58-a5ab-4d34-8603-2e5b2131a6a2</dct:identifier>
        <edp:originalLanguage>de</edp:originalLanguage>
        <dct:creator rdf:resource="http://piveau.io"/>
        <edp:transIssued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
        >2022-07-31T00:06:28Z</edp:transIssued>
        <spdx:checksum>
            <spdx:Checksum>
                <spdx:algorithm rdf:resource="http://spdx.org/rdf/terms#checksumAlgorithm_md5"/>
                <spdx:checksumValue>ef0676bea69a09053ac2ba52e23f271a</spdx:checksumValue>
            </spdx:Checksum>
        </spdx:checksum>
        <dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
        >2022-07-31T00:06:28Z</dct:modified>
    </dcat:CatalogRecord>
</rdf:RDF>
Categorisation

The data.europa.eu categories are based on the EU controlled data theme vocabulary. The following are the categories used on data.europa.eu.

AGRI Agriculture, fisheries, forestry and food
ECON Economy and finance
EDUC Education, culture and sport
ENER Energy
ENVI Environment
GOVE Government and public sector
HEAL Health
INTR International issues
JUST Justice, legal system and public safety
REGI Regions and cities
SOCI Population and society
TECH Science and technology
TRAN Transport

TRAN Transport

When providing data, publishers should always use these terms to thematically categorise the datasets. If a different vocabulary is used, it should be aligned (i.e. mapped) to these categories.

Requests

The harvester currently supports harvesting from an open archives initiative protocol for metadata harvesting (OAI-PMH) [^3] compliant source or from reading a dump file containing the RDF/XML representation of the datasets or directly reading DCAT-AP from a SPARQL endpoint. If datasets are provided as a dump file, it is recommended to split the file into pages, for example, by using the hydra core vocabulary 3.

For OAI-PMH-compliant sources, only the verb 'ListRecords' is used.

Responses

As indicated above, the response must be DCAT-AP-compliant to be understood by the harvesting component.

Error handling

The OAI-PMH protocol provides methods for error handling that the harvester can understand. When using this protocol, these error methods should be used.

Service information for integration

As stated above, a categorisation mapping should be provided. Apart from that, the URL for the OAI-PMH endpoint or the dump file is needed.

CKAN API

The open-source data portal platform CKAN 4 is widely used for building open data platforms. Its RPC-style 5 API (action API) is supported as an interface for data suppliers of data.europa.eu. Basically the following options for using that interface are available.

  • The data supplier uses CKAN for providing its open data metadata. It is important that the used CKAN version supports the action API 6. The legacy APIs of CKAN are not supported.

  • The data supplier offers a CKAN compliant API, where the necessary endpoints reproduce the exact API behaviour.

Requests and responses

Only the 'package_search' API endpoint is needed in order to harvest the metadata. Its specifications are described in detail in the official documentation 7. This endpoint is used to get the metadata in a paginated way. Therefore it accepts query parameters in a request and returns a dictionary with datasets as a result. The high-level use of this endpoint has to be offered as specified in the CKAN documentation.

Example call: GET http://www.example.com/api/3/action/package_search?rows=50

Metadata model

Although the CKAN API can be used as is, the basic CKAN data schema was extended and modified to meet the requirements of the underlying data structure (DCAT-AP) of the data.europa.eu. The response of the 'package_search' action exposes a 'results' field, which holds a list of dictised datasets. The data structure of such a dataset differs from the one of a plain CKAN installation.

Notes:

  • Bold fields are CKAN standard. Further information in the official documentation.

  • Type specifications according to official JSON standard (http://json.org/).

  • Besides the mandatory fields, the field names and types are not strict, but a data supplier has to make sure an obvious mapping is possible.

  • For a detailed explanation of each field, refer to the DCAT-AP specifications.

Dataset schema

The following fields are mandatory.

Field Type DCAT-AP dataset equivalent
title string dct:title
notes string dct:description

The following fields are optional but highly recommended.

Field Type DCAT-AP dataset equivalent
contact_point array of objects (allowed members: type, name, email, resource) dcat:contactPoint
tags array of objects dcat:keyword
publisher object dct:publisher
groups array of objects – the name of each group needs to fit the official categorisation dcat:theme
resources array of objects (see distribution schema ) dcat:distribution

The following fields are optional.

Field Type DCAT-AP dataset equivalent
conforms_to array of objects (allowed members: label, resource) dct:conformsTo
creator object dct:creator
accrucal_periodicity object dct:accrucalPeriodicity
identifier object dct:identifier
url string dcat:landing_page
language array of objects (allowed members: label, resource) dct:language
other_identifier object adms:identifier
issued string dct:issued
dcat_spatial array of objects (allowed members: label, resource) dct:spatial
temporal array of objects (allowed members: start_date, end_date) dct:temporal
modified string dct:modified
version_info string owl:versionInfo
version_notes string adms:versionNotes
provenance array of objects (allowed members: label, resource) dct:provenance
source array of strings dct:source
access_rights object dct:accessRights
has_version array of strings dct:hasVersion
is_version_of array of strings dct:isVersionOf
relation array of strings dct:relation
page array of strings foaf:page
sample array of strings adms:sample
dct_type string dct:type

Distribution schema

The following fields are mandatory.

Field Type DCAT-AP distribution equivalent
url string dcat:accessURL

The following fields are optional but highly recommended.

Field Type DCAT-AP distribution equivalent
description string dct:description
format string dct:format
license object dct:license

Note that the list of licences recognised by data.europa.eu's DCAT-AP parser is available online (https://data.europa.eu/en/training/licensing-assistant). This is also used by our metadata quality assessment (MQA) tool 8 for assessing the data providers' performance in using known licences.

The following fields are optional.

Field Type DCAT-AP distribution equivalent
checksum object spdx:checksum
mimetype string dcat:mediaType
download_url array of strings dcat:downloadURL
issued string dct:issued
status object adms:status
name string dct:title
modified string dct:modified
rights object dct:rights
page array of strings foaf:page
size number dcat:byteSize
language array of objects dct:language
conforms_to array of objects dct:conformsTo
Example

A result of the 'package_search' action looks like this.

{
   "help":"http://example.eu/data/api/3/action/help_show?name=package_search",
   "success":true,
   "result":{
      "count":113948,
      "sort":"score desc, metadata_modified desc",
      "facets":{

      },
      "results":[
         {
            "issued":"2011-10-20T00:00:00Z",
            "id":"525abe30-ef60-4bf9-824e-916368c1fad8",
            "metadata_created":"2015-09-15T12:08:54.860742",
            "metadata_modified":"2015-09-15T13:17:51.405474",
            "temporal":[
               {
                  "start_date":"2011-10-19T22:00:00Z",
                  "end_date":"2011-10-19T22:00:00Z"
               }
            ],
            "state":"active",
            "type":"dataset",
            "resources":[
               {
                  "package_id":"525abe30-ef60-4bf9-824e-916368c1fad8",
                  "id":"7166a1fa-d994-4d88-8e76-3378930b1e16",
                  "state":"active",
                  "format":"XHTML",
                  "mimetype":"application/xhtml+xml",
                  "name":"Example",
                  "created":"2015-09-15T14:39:43.865240",
                  "url":"http://example.com"
               }
            ],
            "tags":[
               {
                  "vocabulary_id":null,
                  "state":"active",
                  "display_name":"Example Tag",
                  "id":"06993102-a2ee-4e40-b9e4-ed3e4b86e943",
                  "name":"example-tag"
               }
            ],
            "groups":[
               {
                  "display_name":"Economy and finance",
                  "description":"",
                  "title":"Economy and finance",
                  "id":"128d0956-4526-440e-a951-f153c190d890",
                  "name":"economy-and-finance"
               }
            ],
            "creator_user_id":"0ab3c2ec-c2a2-4eef-b70f-ed093e028063",
            "publisher":{
               "resource":"http://example.com "
            },
            "organization":{
               "description":"Example Organization",
               "created":"2015-09-15T13:56:32.985936",
               "title":"Example Organization",
               "name":"example-orag",
               "is_organization":true,
               "state":"active",
               "image_url":"",
               "revision_id":"ea70fb1f-29a8-4e7b-8527-809e4792a75b",
               "type":"organization",
               "id":"0897b420-3c3d-4a19-9c2c-a9815e2db2be",
               "approval_status":"approved"
            },
            "name":"example-dataset",
            "notes":"Example",
            "owner_org":"0897b420-3c3d-4a19-9c2c-a9815e2db2be",
            "modified":"2011-10-20T00:00:00Z",
            "url":"",
            "title":"Example Dataset",
            "identifier":[
               "http://example-ident.com"
            ]
         }
      ],
      "search_facets":{      
      }
   }
}
Translation

The following fields of datasets and distributions will be translated in 24 languages if not provided:

  • title

  • description.

CSW/INSPIRE catalogue services (for geospatial metadata)

General remarks

This interface represents an INSPIRE compliant catalogue (discovery) service 9. It is defined as a slightly extended version of the OGC CSW AP ISO 10.

The GetCapabilities operation (mandatory for all OGC Services) is not needed for running the harvesting. But this operation could be helpful upon registration of the catalogue service within the EU Data Portal as the service's response provides additional information which must otherwise be found out during the registration (e.g. the supported protocol bindings or the support of the 'modified' queryable for selective harvesting).

For the harvesting process only the GetRecords operation will be called. The GetRecordById is not needed.

Operation Operation description data.europa.eu usage
GetCapabilities Retrieve catalogue service metadata Only for gathering service information upon registration
GetRecords Retrieval of a bunch of metadata items Yes
GetRecordById Retrieval information of single metadata items No

Table of OGC CSW Operations used by data.europa.eu

Metadata model

The metadata model considered is as defined in the INSPIRE Technical Guidance on Discovery Services 11 and on Metadata 12.

Within a GetRecords query (constraint) just the following metadata model elements (queryables) are used (see table).

Request parameter Definition a Used values in data.europa.eu XPath b
Type Provides the desired information resources. Always the following fixed values used: 'dataset', 'datasetcollection','series' and 'service' /gmd:MD_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode/@codeListValue
Modified The metadata date stamp in case of selective harvesting (if supported), see below. Date /gmd:MD_Metadata/gmd:dateStamp/gco:Date
a: 'Definition' represents the semantic meaning of element in data.europa.eu, it is slightly different from the genetic meaning in OGC CSW.b: Element's XML path in GetRecords request.

Table of GetRecords queryables (not parameters – see below)

Example query (constraint).

<Constraint version="1.1.0">
 <ogc:Filter>
  <ogc:Or>
   <ogc:PropertyIsEqualTo>
    <ogc:PropertyName>Type</ogc:PropertyName>
    <ogc:Literal>dataset</ogc:Literal>
   </ogc:PropertyIsEqualTo>
   <ogc:PropertyIsEqualTo>
    <ogc:PropertyName>Type</ogc:PropertyName>
    <ogc:Literal>datasetcollection</ogc:Literal>
   </ogc:PropertyIsEqualTo>
   <ogc:PropertyIsEqualTo>
    <ogc:PropertyName>Type</ogc:PropertyName>
    <ogc:Literal>series</ogc:Literal>
   </ogc:PropertyIsEqualTo>
   <ogc:PropertyIsEqualTo>
    <ogc:PropertyName>Type</ogc:PropertyName>
    <ogc:Literal>service</ogc:Literal>
   </ogc:PropertyIsEqualTo>
  </ogc:Or>
 </ogc:Filter>
</Constraint>

As defined in the NSPIRE Technical Guidance on Discovery Services 13 the operation must be able to return ISO19139 metadata aligned with the Inspire Technical Guidance on Metadata 14.

Requests

The mandatory GetRecords operation works as the primary means of metadata item discovery with HTTP protocol binding. It executes an inventory search and returns the metadata items. Only OGC Filter XML encoding is supported. For the GetRecords requests a few additional requirements exist. These will be explained in the following.

Bindings

One or more of HTTP POST/XML, POST/XML/SOAP1.1 and POST/XML/SOAP1.2 have to be supported as bindings.

Operation parameters

The following parameters (not the queryables) and parameter values are used in data.europa.eu for the GetRecords requests.

Request parameter Definition a Used values in data.europa.eu XPath b
service Tells this is a CSW service. Always fixed value: CSW /GetRecords@service
version Tells which version of CSW service is requested. Always fixed value; 2.0.2 /GetRecords@version
resultType Specifies the type of result Always fixed value: 'results' /GetRecords@resultType
outputFormat Specifies the output format of GetRecords returned document Always fixed value: 'application/xml' /GetRecords@outputFormat
outputSchema Specifies the schema of GetRecords returned document Always fixed value (namespace):'http://www.isotc211.org/2005/gmd' /GetRecords@outputSchema
startPosition Specifies the sequence number of first returned record Used: integer between 1 and returned numberDefault value is 1 /GetRecords@startPosition
maxRecords Specifies number of returned records Supported: positive integer between 1 and N.Default value is: 50 /GetRecords@maxRecords
typeNames Specifies the query- and elementSetName type Always fixed value: 'gmd:MD_Metadata''gmd' is valid namespace prefix for 'http://www.isotc211.org/2005/gmd' /GetRecords/Query@typeNameAnd/GetRecords/Query/ElementSetName@typeName
ElementSetName Specifies the type of GetRecords returned document As only full metadata sets will be requested by the harvester this parameter will always be set to 'full'. /GetRecords/Query/ElementSetName
a: 'Definition' represents the semantic meaning of element in data.europa.eu. it is slightly different from the genetic meaning in OGC CSW.b: Element's XML path in GetRecords request.

Table of GetRecords request parameters

Partitioning

For partitioning (pagination) the following parameters are used (see table on GetRecords):

  • startPosition;

  • maxRecords.

Selective harvesting

Selective harvesting allows harvesters to limit harvest requests to just those portions of the metadata available from a repository which have been changed within a specified time frame.

Selective harvesting often makes sense as this would require that only a few metadata records be harvested daily as only a few metadata records are changed within a day.

For selective harvesting the predefined queryable (usually 'modified' -- see table of GetRecords) is used.

Responses

As defined by in the INSPIRE Technical Guidance on Discovery Services 15 the operation must be able to return ISO19139 metadata aligned with the INSPIRE Technical Guidance on Metadata 16.

Partitioning

For partitioning (pagination) as part of the search response, it is mandatory to have the total count of matching metadata items returned, even if the metadata for these items is not contained in the search response. This parameter, coupled with the ability to specify the startPosition and the number of desired records (maxRecs) from the catalogue , will allow data.europa.eu to implement results paging and reducing the load on both the data.europa.eu system and on the data partners.

Error handling

Useful status and error messages help data.europa.eu manage client sessions effectively. Any limitations on submitted search requests to the inventory systems should be noted in the response (e.g. 'too many records requested', 'search timed out') so that predictable error handling can be managed by data.europa.eu.

Service information for integration

To be able to integrate an INSPIRE Discovery Service /CSW the following information need to be provided by the data supplier.

Service information Definition a Obligation (M=Mandatory, O=Optional, C=Conditional) Datatype
GetRecords URL URL of the CSW GetRecords operation M URL
GetRecords Binding URL of the CSW GetRecords operation M Codelist (one of): 'POST/XML', 'POST/XML/SOAP1.1' 'POST/XML/SOAP1.2'
Modifieda Name of the queryable (if supported) for filtering on the metadata date stamp (for selective harvesting) Possibly for future use String.[Namespace":"]QueryableName
MaxRecordsMax Specifies the maximal number of maximal returned records Possibly for future use (currently always set to '50') Integer
a = Value in CSW filter will be formatted as 'MM-DD-YYYY'. Operators: '>=', '\<=' will be used.

Table of Service information needed for integration

Operational requirements

Harvesting frequency

Due to the high volume of metadata that will be harvested from a growing list of data suppliers and the required runtime for the harvesting processes, each data supplier site will probably not be harvested on a daily basis. The harvesting processes have to be clustered and scheduled on a fixed time schedule (e.g. during the night) in order to avoid any load impacts on the harvested sites during regular business hours usage.

Quality of the harvested datasets

Data source site API / endpoints

The REST API of the data source site should accept queries with, for example, startPos/maxRecs parameters for resumption/partitioning of the datasets to be harvested.

Avoiding duplicates

Duplicate datasets should be avoided by the source site.

Error reporting on harvested metadata

The MQA module provides a graphical report on the quality of the harvested datasets' metadata by providing access to a dashboard that summarises the main quality indicators, for example, availability and accessibility of distributions, compliance of datasets to metadata formats, and source of violations.

The MQA dashboard can be opened directly from the portal homepage.

User feedback on datasets

Users will be able to provide feedback on a dataset directly from the dataset detail page.

The system will make it possible to gather and extract all feedback received for all datasets and group those by data supplier, so that the feedback can be sent to the data supplier.

Checklist

The goal of this checklist is to gather and summarise all main requirements for successfully harvesting a data supplier site and assure a certain quality level of the harvested datasets.

Requirement Value Comment
1 Make sure that your portal provides metadata Only metadata can be harvested, not the data itself
2 Which metadata standard is supported? DCAT-AP/CKAN/ISO19139(Inspire)
3 Which representation of the metadata is used? XML/JSON or any RDF representation
4 Which API is used to retrieve the data? CKAN/OAI-PMH/RDF dump file/SPARQL endpoint/CSW
5 Is authentication required for you to access your API? yes/no
6 Include complete vocabulary for categorisation, or other fields that use a defined vocabulary (for example update frequency) With translation, if applicable
7 Use standard date/time formats ISO8601
8 How often can/should the site be harvested? daily/weekly/monthly/etc.
9 What shall be the title of the catalogue?
10 What shall be the description of the catalogue?
11 Who is the publisher of the catalogue (name and email address)
12 Which end point would you like us to harvest?

  1. http://www.w3.org/TR/vocab-dcat/ 

  2. https://www.openarchives.org/pmh/ 

  3. https://www.hydra-cg.com/spec/latest/core/ 

  4. http://ckan.org/ 

  5. Remote procedure call 

  6. http://docs.ckan.org/en/ckan-2.4.0/api/index.html#action-api-reference 

  7. http://docs.ckan.org/en/ckan-2.4.0/api/index.html#ckan.logic.action.get.package_search 

  8. See https://www.europeandataportal.eu/mqa?locale=en 

  9. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) 

  10. OGC Catalogue Services Specification 2.0.2 – ISO metadata application profile : corrigendum, No 1.0.1, Open Geospatial Consortium, 7 March 2018, OGC 07-045rl (https://portal.ogc.org/files/80534) 

  11. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) 

  12. Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139). 

  13. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) 

  14. Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139). 

  15. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. (https://inspire.ec.europa.eu/documents/technical-guidance-implementation-inspire-discovery-services-0) 

  16. Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139).