Skip to content


EU institutions, agencies and other bodies, and the Member States (the 'data providers') are autonomous in publishing their open data. Harvesting is the recommended method for publishers who manage their data in a data catalogue.

Checklist for your portal harvests openly shared information available on public-sector, open-data portals. If you want your portal or website to be harvested by, please share your answers to the following questions via the contact form. When sharing this form, please select 'Get harvested by' when answering the question 'Please choose an issue type'. Once we receive your request, we will assess it and keep you informed about its status.


Do you give consent that is allowed to send email to the catalogue's publisher to inform about harvesting activities?

Please provide us with the following information about your catalogue.

  • Uniform resource locator (URL) to interface (REST, CSW...).
  • URL to homepage.
  • Title of the catalogue.
  • Description of the catalogue.
  • Publisher of the catalogue.
  • Email address of the catalogue.
  • Default language of the catalogues datasets.
  • How often can/should the site be harvested (e.g. once a week)?
  • Are there any times when the site should not be harvested (e.g. scheduled maintenance)?

Technical requirements/constraints

The harvester accesses the endpoints of all catalogues mostly on a daily basis, depending on the size of a catalogue. We process the collected data overnight. We transform every incoming format to DCAT-AP 2.1.1 with a hash is built over every harvested dataset. This hash value is compared to the existing hash value before a dataset is potentially updated in our triplestore. Updates take place only when an inequality is found. The harvester is configured specifically for each harvested portal.

Access to harvested sites


Some source sites require authentication, meaning we need a login name and password before we can access the data (here

If this applies to your portal, please state this in your message when using our contact form.

API access to harvested site

For harvesting to take place, the source site needs to have in place one of the interfaces as described in detail in the Interface supported for harvesting section.

FTP access to harvested site does not support FTP for downloading datasets from a source site.

Interfaces supported for harvesting

The following sections describe the list of interfaces that data suppliers (e.g. national portals, public data portals in the Member States, portals from international organisations etc.) must have in place in order to be harvested by

The main supported interfaces are the following:

  • DCAT-AP / Comprehensive Knowledge Archive Network (CKAN) compliant sites (for 'normal' datasets);

  • CSW/Inspire catalogue services (for geospatial datasets);

  • OpenSearch (GEO/EOP) (for geospatial datasets).


Providing data via a DCAT-AP interface is the official recommended method and will always be preferred for harvesting.

General remarks

DCAT-AP is a metadata specification for describing public sector datasets in Europe. It's based on the data catalogue vocabulary 1. The datasets are provided as linked data and can be represented in multiple ways. For the harvesting process, any common representation like rdf/xml, n-triples or turtle is allowed.

Metadata model

For general information on the metadata model, please refer to the official documentation 2. The respective qualifiers (mandatory, recommended and optional) need to be adhered to. The following is an example dataset with all the mandatory properties in rdf/xml.

<?xml version="1.0"?>
    <dcat:Dataset rdf:about="">
        <dct:alternative xml:lang="en">EU solidarity with Ukraine</dct:alternative>
        <dct:subject rdf:resource=""/>
            <dcat:Distribution rdf:about="">
                <dcat:accessURL rdf:resource="!jb8pvX"/>
                <dct:title xml:lang="en">List of measures</dct:title>
                <dcat:downloadURL rdf:resource="!jb8pvX"/>
                <dct:type rdf:resource=""/>
                    <dct:RightsStatement rdf:about=""/>
                    <dct:LicenseDocument rdf:about=""/>
                <dct:description xml:lang="en">List of measures in csv format</dct:description>
                    <dct:MediaType rdf:about=""/>
                    <dct:MediaTypeOrExtent rdf:about=""/>
            <dcat:Distribution rdf:about="">
                    <dct:MediaTypeOrExtent rdf:about=""/>
                <dct:title xml:lang="en">List of measures</dct:title>
                <dct:rights rdf:resource=""/>
                <dcat:accessURL rdf:resource="!bCfRRj"/>
                <dcat:downloadURL rdf:resource="!bCfRRj"/>
                <dct:type rdf:resource=""/>
                <dct:license rdf:resource=""/>
                <adms:status rdf:resource=""/>
                    <dct:MediaType rdf:about=""/>
                <dct:description xml:lang="en">List of measures in html format with actionable links</dct:description>
        <dct:accessRights rdf:resource=""/>
            <foaf:Document rdf:about="">
                <foaf:topic rdf:resource=""/>
                <foaf:mbox rdf:resource=""/>
                <foaf:name>Landesamt für Digitalisierung, Breitband und Vermessung </foaf:name>
                <foaf:homepage rdf:resource=""/>
        <dct:issued rdf:datatype="">2022-03-17</dct:issued>
        <dct:description xml:lang="en">This dataset contains a list of documents published on EUR-Lex that bring together the measures the EU has taken in solidarity with Ukraine. This list includes measures of assistance to Ukraine as well as related restrictive measures. Acts amended in the above mentioned context are also included. The list is updated regularly.</dct:description>
            <dct:Frequency rdf:about=""/>
        <dct:title xml:lang="en">Measures in solidarity with Ukraine</dct:title>
            <dct:LinguisticSystem rdf:about=""/>
            <vcard:Kind rdf:about="">
                <vcard:hasEmail rdf:resource=""/>
                    <foaf:Document rdf:about="">
                        <foaf:topic rdf:resource="file:///usr/verticles/"/>

The categories are based on the EU controlled data theme vocabulary. The following are the categories used on

AGRI Agriculture, fisheries, forestry and food
ECON Economy and finance
EDUC Education, culture and sport
ENER Energy
ENVI Environment
GOVE Government and public sector
HEAL Health
INTR International issues
JUST Justice, legal system and public safety
REGI Regions and cities
SOCI Population and society
TECH Science and technology
TRAN Transport

TRAN Transport

When providing data, publishers should always use these terms to thematically categorise the datasets. If a different vocabulary is used, it should be aligned (i.e. mapped) to these categories.


The harvester currently supports harvesting from an open archives initiative protocol for metadata harvesting (OAI-PMH) [^3] compliant source or from reading a dump file containing the RDF/XML representation of the datasets or directly reading DCAT-AP from a SPARQL endpoint. If datasets are provided as a dump file, it is recommended to split the file into pages, for example, by using the hydra core vocabulary 3.

For OAI-PMH-compliant sources, only the verb 'ListRecords' is used.


As indicated above, the response must be DCAT-AP-compliant to be understood by the harvesting component.

Error handling

The OAI-PMH protocol provides methods for error handling that the harvester can understand. When using this protocol, these error methods should be used.

Service information for integration

As stated above, a categorisation mapping should be provided. Apart from that, the URL for the OAI-PMH endpoint or the dump file is needed.


The open-source data portal platform CKAN 4 is widely used for building open data platforms. Its RPC-style 5 API (action API) is supported as an interface for data suppliers of Basically the following options for using that interface are available.

  • The data supplier uses CKAN for providing its open data metadata. It is important that the used CKAN version supports the action API 6. The legacy APIs of CKAN are not supported.

  • The data supplier offers a CKAN compliant API, where the necessary endpoints reproduce the exact API behaviour.

Requests and responses

Only the 'package_search' API endpoint is needed in order to harvest the metadata. Its specifications are described in detail in the official documentation 7. This endpoint is used to get the metadata in a paginated way. Therefore it accepts query parameters in a request and returns a dictionary with datasets as a result. The high-level use of this endpoint has to be offered as specified in the CKAN documentation.

Example call: GET

Metadata model

Although the CKAN API can be used as is, the basic CKAN data schema was extended and modified to meet the requirements of the underlying data structure (DCAT-AP) of the The response of the 'package_search' action exposes a 'results' field, which holds a list of dictised datasets. The data structure of such a dataset differs from the one of a plain CKAN installation.


  • Bold fields are CKAN standard. Further information in the official documentation.

  • Type specifications according to official JSON standard (

  • Besides the mandatory fields, the field names and types are not strict, but a data supplier has to make sure an obvious mapping is possible.

  • For a detailed explanation of each field, refer to the DCAT-AP specifications.

Dataset schema

The following fields are mandatory.

Field Type DCAT-AP dataset equivalent
title string dct:title
notes string dct:description

The following fields are optional but highly recommended.

Field Type DCAT-AP dataset equivalent
contact_point array of objects (allowed members: type, name, email, resource) dcat:contactPoint
tags array of objects dcat:keyword
publisher object dct:publisher
groups array of objects – the name of each group needs to fit the official categorisation dcat:theme
resources array of objects (see distribution schema ) dcat:distribution

The following fields are optional.

Field Type DCAT-AP dataset equivalent
conforms_to array of objects (allowed members: label, resource) dct:conformsTo
creator object dct:creator
accrucal_periodicity object dct:accrucalPeriodicity
identifier object dct:identifier
url string dcat:landing_page
language array of objects (allowed members: label, resource) dct:language
other_identifier object adms:identifier
issued string dct:issued
dcat_spatial array of objects (allowed members: label, resource) dct:spatial
temporal array of objects (allowed members: start_date, end_date) dct:temporal
modified string dct:modified
version_info string owl:versionInfo
version_notes string adms:versionNotes
provenance array of objects (allowed members: label, resource) dct:provenance
source array of strings dct:source
access_rights object dct:accessRights
has_version array of strings dct:hasVersion
is_version_of array of strings dct:isVersionOf
relation array of strings dct:relation
page array of strings foaf:page
sample array of strings adms:sample
dct_type string dct:type

Distribution schema

The following fields are mandatory.

Field Type DCAT-AP distribution equivalent
url string dcat:accessURL

The following fields are optional but highly recommended.

Field Type DCAT-AP distribution equivalent
description string dct:description
format string dct:format
license object dct:license

Note that the list of licences recognised by's DCAT-AP parser is available online ( This is also used by our metadata quality assessment (MQA) tool 8 for assessing the data providers' performance in using known licences.

The following fields are optional.

Field Type DCAT-AP distribution equivalent
checksum object spdx:checksum
mimetype string dcat:mediaType
download_url array of strings dcat:downloadURL
issued string dct:issued
status object adms:status
name string dct:title
modified string dct:modified
rights object dct:rights
page array of strings foaf:page
size number dcat:byteSize
language array of objects dct:language
conforms_to array of objects dct:conformsTo

A result of the 'package_search' action looks like this.

      "sort":"score desc, metadata_modified desc",

                  "display_name":"Example Tag",
                  "display_name":"Economy and finance",
                  "title":"Economy and finance",
               "resource":" "
               "description":"Example Organization",
               "title":"Example Organization",
            "title":"Example Dataset",

The following fields of datasets and distributions will be translated in 24 languages if not provided:

  • title

  • description.

CSW/INSPIRE catalogue services (for geospatial metadata)

General remarks

This interface represents an INSPIRE compliant catalogue (discovery) service 9. It is defined as a slightly extended version of the OGC CSW AP ISO 10.

The GetCapabilities operation (mandatory for all OGC Services) is not needed for running the harvesting. But this operation could be helpful upon registration of the catalogue service within the EU Data Portal as the service's response provides additional information which must otherwise be found out during the registration (e.g. the supported protocol bindings or the support of the 'modified' queryable for selective harvesting).

For the harvesting process only the GetRecords operation will be called. The GetRecordById is not needed.

Operation Operation description usage
GetCapabilities Retrieve catalogue service metadata Only for gathering service information upon registration
GetRecords Retrieval of a bunch of metadata items Yes
GetRecordById Retrieval information of single metadata items No

Table of OGC CSW Operations used by

Metadata model

The metadata model considered is as defined in the INSPIRE Technical Guidance on Discovery Services 11 and on Metadata 12.

Within a GetRecords query (constraint) just the following metadata model elements (queryables) are used (see table).

Request parameter Definition a Used values in XPath b
Type Provides the desired information resources. Always the following fixed values used: 'dataset', 'datasetcollection','series' and 'service' /gmd:MD_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode/@codeListValue
Modified The metadata date stamp in case of selective harvesting (if supported), see below. Date /gmd:MD_Metadata/gmd:dateStamp/gco:Date
a: 'Definition' represents the semantic meaning of element in, it is slightly different from the genetic meaning in OGC CSW.b: Element's XML path in GetRecords request.

Table of GetRecords queryables (not parameters – see below)

Example query (constraint).

<Constraint version="1.1.0">

As defined in the NSPIRE Technical Guidance on Discovery Services 13 the operation must be able to return ISO19139 metadata aligned with the Inspire Technical Guidance on Metadata 14.


The mandatory GetRecords operation works as the primary means of metadata item discovery with HTTP protocol binding. It executes an inventory search and returns the metadata items. Only OGC Filter XML encoding is supported. For the GetRecords requests a few additional requirements exist. These will be explained in the following.


One or more of HTTP POST/XML, POST/XML/SOAP1.1 and POST/XML/SOAP1.2 have to be supported as bindings.

Operation parameters

The following parameters (not the queryables) and parameter values are used in for the GetRecords requests.

Request parameter Definition a Used values in XPath b
service Tells this is a CSW service. Always fixed value: CSW /GetRecords@service
version Tells which version of CSW service is requested. Always fixed value; 2.0.2 /GetRecords@version
resultType Specifies the type of result Always fixed value: 'results' /GetRecords@resultType
outputFormat Specifies the output format of GetRecords returned document Always fixed value: 'application/xml' /GetRecords@outputFormat
outputSchema Specifies the schema of GetRecords returned document Always fixed value (namespace):'' /GetRecords@outputSchema
startPosition Specifies the sequence number of first returned record Used: integer between 1 and returned numberDefault value is 1 /GetRecords@startPosition
maxRecords Specifies number of returned records Supported: positive integer between 1 and N.Default value is: 50 /GetRecords@maxRecords
typeNames Specifies the query- and elementSetName type Always fixed value: 'gmd:MD_Metadata''gmd' is valid namespace prefix for '' /GetRecords/Query@typeNameAnd/GetRecords/Query/ElementSetName@typeName
ElementSetName Specifies the type of GetRecords returned document As only full metadata sets will be requested by the harvester this parameter will always be set to 'full'. /GetRecords/Query/ElementSetName
a: 'Definition' represents the semantic meaning of element in it is slightly different from the genetic meaning in OGC CSW.b: Element's XML path in GetRecords request.

Table of GetRecords request parameters


For partitioning (pagination) the following parameters are used (see table on GetRecords):

  • startPosition;

  • maxRecords.

Selective harvesting

Selective harvesting allows harvesters to limit harvest requests to just those portions of the metadata available from a repository which have been changed within a specified time frame.

Selective harvesting often makes sense as this would require that only a few metadata records be harvested daily as only a few metadata records are changed within a day.

For selective harvesting the predefined queryable (usually 'modified' -- see table of GetRecords) is used.


As defined by in the INSPIRE Technical Guidance on Discovery Services 15 the operation must be able to return ISO19139 metadata aligned with the INSPIRE Technical Guidance on Metadata 16.


For partitioning (pagination) as part of the search response, it is mandatory to have the total count of matching metadata items returned, even if the metadata for these items is not contained in the search response. This parameter, coupled with the ability to specify the startPosition and the number of desired records (maxRecs) from the catalogue , will allow to implement results paging and reducing the load on both the system and on the data partners.

Error handling

Useful status and error messages help manage client sessions effectively. Any limitations on submitted search requests to the inventory systems should be noted in the response (e.g. 'too many records requested', 'search timed out') so that predictable error handling can be managed by

Service information for integration

To be able to integrate an INSPIRE Discovery Service /CSW the following information need to be provided by the data supplier.

Service information Definition a Obligation (M=Mandatory, O=Optional, C=Conditional) Datatype
GetRecords URL URL of the CSW GetRecords operation M URL
GetRecords Binding URL of the CSW GetRecords operation M Codelist (one of): 'POST/XML', 'POST/XML/SOAP1.1' 'POST/XML/SOAP1.2'
Modifieda Name of the queryable (if supported) for filtering on the metadata date stamp (for selective harvesting) Possibly for future use String.[Namespace":"]QueryableName
MaxRecordsMax Specifies the maximal number of maximal returned records Possibly for future use (currently always set to '50') Integer
a = Value in CSW filter will be formatted as 'MM-DD-YYYY'. Operators: '>=', '\<=' will be used.

Table of Service information needed for integration

Operational requirements

Harvesting frequency

Due to the high volume of metadata that will be harvested from a growing list of data suppliers and the required runtime for the harvesting processes, each data supplier site will probably not be harvested on a daily basis. The harvesting processes have to be clustered and scheduled on a fixed time schedule (e.g. during the night) in order to avoid any load impacts on the harvested sites during regular business hours usage.

Quality of the harvested datasets

Data source site API / endpoints

The REST API of the data source site should accept queries with, for example, startPos/maxRecs parameters for resumption/partitioning of the datasets to be harvested.

Avoiding duplicates

Duplicate datasets should be avoided by the source site.

Error reporting on harvested metadata

The MQA module provides a graphical report on the quality of the harvested datasets' metadata by providing access to a dashboard that summarises the main quality indicators, for example, availability and accessibility of distributions, compliance of datasets to metadata formats, and source of violations.

The MQA dashboard can be opened directly from the portal homepage.

User feedback on datasets

Users will be able to provide feedback on a dataset directly from the dataset detail page.

The system will make it possible to gather and extract all feedback received for all datasets and group those by data supplier, so that the feedback can be sent to the data supplier.


The goal of this checklist is to gather and summarise all main requirements for successfully harvesting a data supplier site and assure a certain quality level of the harvested datasets.

Requirement Value Comment
1 Make sure that your portal provides metadata Only metadata can be harvested, not the data itself
2 Which metadata standard is supported? DCAT-AP/CKAN/ISO19139(Inspire)
3 Which representation of the metadata is used? XML/JSON or any RDF representation
4 Which API is used to retrieve the data? CKAN/OAI-PMH/RDF dump file/SPARQL endpoint/CSW
5 Is authentication required for you to access your API? yes/no
6 Include complete vocabulary for categorisation, or other fields that use a defined vocabulary (for example update frequency) With translation, if applicable
7 Use standard date/time formats ISO8601
8 How often can/should the site be harvested? daily/weekly/monthly/etc.
9 What shall be the title of the catalogue?
10 What shall be the description of the catalogue?
11 Who is the publisher of the catalogue (name and email address)
12 Which end point would you like us to harvest?





  5. Remote procedure call 



  8. See 

  9. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. ( 

  10. OGC Catalogue Services Specification 2.0.2 – ISO metadata application profile : corrigendum, No 1.0.1, Open Geospatial Consortium, 7 March 2018, OGC 07-045rl ( 

  11. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. ( 

  12. Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 ( 

  13. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. ( 

  14. Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 ( 

  15. Technical guidance for the implementation of INSPIRE discovery services, Initial Operating Capability Task Force for Network Services, 7 November 2011. ( 

  16. Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007, Inspire Maintenance and Implementation Group, 1 August 2022 (