Skip to content

Data quality

Our team is aware that the demand for high-quality data is still growing, with a focus on data that is publicly available and can be easily reused for different purposes. Poor quality of data is a major barrier to data reuse. Some data cannot be interpreted due to ill-defined, inaccurate elements such as missing values, mismatches, missing data types, lack of documentation about the structure or format availability (HTML, GIF or PDF). Users find poor-quality data harder to understand and may use it less often. The data provider may even appear less reliable as a result.

For those reasons, our team is involved in different initiatives regarding data quality. One of them was the publication of Data.europa.eu Data Quality Guidelines. This publication contains a set of recommendations for delivering high-quality data. They are addressed to data providers to support them in preparing their data, developing their data strategies and ensuring data quality.

The document is composed of the following four parts.

  1. Recommendations for providing high-quality data. The recommendations cover general aspects of quality issues regarding the findability, accessibility, interoperability and reusability of data (including specific recommendations for common file formats like CSV, JSON, RDF and XML).
  2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment.
  3. Recommendations for documenting data.
  4. Recommendations for improving the ‘openness level’.

In the following subsections you will find tips and quick-reference material for providing high-quality data, standardisation and data enrichment, documenting data and improving the ‘openness level’.

Recommendations for providing high-quality data

Data needs to be carefully prepared before publication. Preparation is an interactive and agile process used to explore, combine, clean and transform raw data into curated, high-quality datasets. This process consists of six different phases, illustrated in the image below.

Data preparation process

Source: Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

General tip: make use of tooling and create a data management plan.

Best practices for providing high-quality data (findability, accessibility, interoperability, reusability)

Best practices

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

Best practices for providing high-quality data – general recommendations

Practice General recommendations
Findability (red)
  • Describe your data with metadata to improve data discovery
  • Mark null values explicitly as such
Accessibility (green)
  • Publish data without restrictions
  • Provide an accessible download URL
Interoperability (blue)
  • Formatting of date and time
  • Formatting of decimal numbers and numbers in the thousands
  • Make use of standardised character encoding
  • Use uniform resource identifiers (URIs) to identify entities
Reusability (yellow)
  • Provide an appropriate amount of data
  • Consider community standards
  • Remove duplicates from your data
  • Increase the accuracy of your data
  • Provide information on byte size

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

Format specific recommendations

CSV

  • Use a semicolon as a delimiter
  • Use one file per table
  • Avoid white space and additional information in the file
  • Insert column headers
  • Ensure that all rows have the same number of columns
  • Indicate units in an easily processable way

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

XML

  • Provide an XML declaration
  • Escape special characters
  • Use meaningful names for identifiers
  • Use attributes and elements correctly
  • Remove program-specific data

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

RDF

  • Use HTTP URIs to denote resources
  • Use namespaces when possible
  • Use existing vocabularies when possible

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

JSON

  • Use suitable data types
  • Use hierarchies for grouping data
  • Only use arrays when required

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

APIs

  • Use correct status codes
  • Set correct headers
  • Use paging for large amounts of data
  • Document the API

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

Recommendations for standardisation and data enrichment

  • Reuse unambiguous concepts from controlled vocabularies (Findability)
  • Harmonise labels (using unique identifiers) (Accessibility)
  • Dereference the translation of a label (Reusability)
  • Link and augment your data (Interoperability)

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

Recommendations for documenting data

  • Publish your documentation (Findability)
  • Use schemas to specify data structure (Accessibility)
  • Document the semantics of data (Reusability)
  • Document data changes (Interoperability)
  • Deprecate old versions (Findability)
  • Link versions of a data set (Accessibility)

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

Recommendations for improving the ‘openness level’

  • Use structured data (Findability)
  • Use a non-proprietary format (Accessibility)
  • Use URIs to denote things (Reusability)
  • Use linked data (Interoperability)

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

File formats and their achievable openness level:

File formats

* Strictly according to the five-star model, this format would have to be rated with three stars, since the data may well be designed to be machine readable. However, we only give one star because this format was not originally intended to represent machine-readable but human-readable content. Representing machine-readable content in this format does not meet best practice and is therefore not recommended by the authors.

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095

Checklist for publishing high-quality data

Make your data FAIR

Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095