Data quality
Our team is aware that the demand for high-quality data is still growing, with a focus on data that is publicly available and can be easily reused for different purposes. Poor quality of data is a major barrier to data reuse. Some data cannot be interpreted due to ill-defined, inaccurate elements such as missing values, mismatches, missing data types, lack of documentation about the structure or format availability (HTML, GIF or PDF). Users find poor-quality data harder to understand and may use it less often. The data provider may even appear less reliable as a result.
For those reasons, our team is involved in different initiatives regarding data quality. One of them was the publication of Data.europa.eu Data Quality Guidelines. This publication contains a set of recommendations for delivering high-quality data. They are addressed to data providers to support them in preparing their data, developing their data strategies and ensuring data quality.
The document is composed of the following four parts.
- Recommendations for providing high-quality data. The recommendations cover general aspects of quality issues regarding the findability, accessibility, interoperability and reusability of data (including specific recommendations for common file formats like CSV, JSON, RDF and XML).
- Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment.
- Recommendations for documenting data.
- Recommendations for improving the ‘openness level’.
In the following subsections you will find tips and quick-reference material for providing high-quality data, standardisation and data enrichment, documenting data and improving the ‘openness level’.
Recommendations for providing high-quality data
Data needs to be carefully prepared before publication. Preparation is an interactive and agile process used to explore, combine, clean and transform raw data into curated, high-quality datasets. This process consists of six different phases, illustrated in the image below.
Source: Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
General tip: make use of tooling and create a data management plan.
Best practices for providing high-quality data (findability, accessibility, interoperability, reusability)
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
Best practices for providing high-quality data – general recommendations
Practice | General recommendations |
---|---|
Findability (red) |
|
Accessibility (green) |
|
Interoperability (blue) |
|
Reusability (yellow) |
|
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
Format specific recommendations
CSV
- Use a semicolon as a delimiter
- Use one file per table
- Avoid white space and additional information in the file
- Insert column headers
- Ensure that all rows have the same number of columns
- Indicate units in an easily processable way
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
XML
- Provide an XML declaration
- Escape special characters
- Use meaningful names for identifiers
- Use attributes and elements correctly
- Remove program-specific data
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
RDF
- Use HTTP URIs to denote resources
- Use namespaces when possible
- Use existing vocabularies when possible
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
JSON
- Use suitable data types
- Use hierarchies for grouping data
- Only use arrays when required
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
APIs
- Use correct status codes
- Set correct headers
- Use paging for large amounts of data
- Document the API
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
Recommendations for standardisation and data enrichment
- Reuse unambiguous concepts from controlled vocabularies (Findability)
- Harmonise labels (using unique identifiers) (Accessibility)
- Dereference the translation of a label (Reusability)
- Link and augment your data (Interoperability)
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
Recommendations for documenting data
- Publish your documentation (Findability)
- Use schemas to specify data structure (Accessibility)
- Document the semantics of data (Reusability)
- Document data changes (Interoperability)
- Deprecate old versions (Findability)
- Link versions of a data set (Accessibility)
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
Recommendations for improving the ‘openness level’
- Use structured data (Findability)
- Use a non-proprietary format (Accessibility)
- Use URIs to denote things (Reusability)
- Use linked data (Interoperability)
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
File formats and their achievable openness level:
* Strictly according to the five-star model, this format would have to be rated with three stars, since the data may well be designed to be machine readable. However, we only give one star because this format was not originally intended to represent machine-readable but human-readable content. Representing machine-readable content in this format does not meet best practice and is therefore not recommended by the authors.
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095
Checklist for publishing high-quality data
Source: based on Publications Office, Data.europa.eu Data Quality Guidelines, 2022, https://data.europa.eu/doi/10.2830/333095