ADMS (asset description metadata schema)
A vocabulary to describe interoperability assets making it possible for ICT developers to explore and search for interoperability assets. ADMS allows public administrations, businesses, standardisation bodies and academia to:
describe semantic assets in a common way so that they can be seamlessly cross-queried and discovered by ICT developers from a single access point;
search, identify, retrieve and compare semantic assets, avoiding duplication and expensive design work through a single point of access;
keep their own system for documenting and storing semantic assets;
improve indexing and visibility of their own assets;
link semantic assets to one another in cross-border and cross-sector settings.
Application programming interface (API) is a way computer programmes talk to one another. It can be understood in terms of how a programmer sends instructions between programmes.
CKAN (comprehensive knowledge archive network)
A data management system that makes data accessible by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organisations) working to make their data open and available.
Organised arrangements of words and phrases used to index and/or to retrieve content. A collection of controlled vocabularies is, for example, part of the multilingual metadata registry (http://publications.europa.eu/mdr/index.html). Known also as authority tables, they group concepts like languages, currencies, interinstitutional procedures and many others.
CORDIS (Community Research and Development Information Service)
The European Commission's primary public repository and portal to disseminate information on all EU-funded research projects and their results.
A crawler is a programme that visits websites and reads their pages and other information to create entries for a search engine index, among others. All major search engines on the web have such a programme, which is also known as a 'spider' or a 'bot'.
When extracting data from the web, the term 'crawling' is often also referred to as 'data scraping' or 'harvesting'. There is a difference between these terms: crawling refers to dealing with datasets where someone can develop their own crawlers (or bots), which crawl to the deepest parts of the web pages; data scraping on the other hand refers to retrieving information from any source (not necessarily from the web).
CSV (comma-separated values)
'Comma-separated values' file format, often used to exchange data between differently similar applications. The CSV file format is useable by KSpread, OpenOffice Calc and Microsoft Excel spreadsheet applications. Many other applications support CSV to import or export data.
Refers to the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to outputs such as journal articles, reports, and conference papers.
A collection of related sets of data that is composed of separate elements but that can be processed as a whole and accessed or downloaded in one or more formats.
DCAT (Data Catalogue Vocabulary)
An RDF vocabulary for interoperability of data catalogues.
See also: http://www.w3.org/TR/vocab-dcat
DCAT-AP (DCAT application profile)
A common vocabulary for describing datasets hosted in data portals in Europe, based on the DCAT. It is the W3C standard.
A large amount of data transferred from one system or location to another.
DCMI (Dublin core metadata initiative)
An open organisation supporting innovation in metadata design and best practices across the metadata ecosystem.
ELI (European legislation identifier)
It allows to uniquely identify and access national and European legislation online and to guarantee easier access, exchange, and use of legislation for public authorities, professional users, academics and citizens. ELI paves the way for a semantic web of legal gazettes and official journals.
'Friend of a friend' is a machine-readable descriptive vocabulary of persons, their activities and their relations to other people and objects. FOAF allows groups of people to describe social networks without the need for a centralised database.
GeoDCAT-AP is an extension of DCAT-AP for describing geospatial datasets, dataset series and services. It provides an RDF syntax binding for the union of metadata elements defined in the core profile of ISO 19115:2003 and those defined in the framework of the Inspire directive. Its basic use case is to make spatial datasets, data series and services searchable on general data portals, thereby making geospatial information better searchable across borders and sectors. This can be achieved by the exchange of descriptions of datasets among data portals.
The GeoDCAT-AP specification does not replace the Inspire metadata regulation nor the Inspire metadata technical guidelines based on ISO 19115 and ISO 19119. Its purpose is given owners of geospatial metadata the possibility to achieve more by providing an additional RDF syntax binding.
More information: https://joinup.ec.europa.eu/release/geodcat-ap-v10
IMMC (Interinstitutional Metadata Maintenance Committee)
Interinstitutional Metadata Maintenance Committee. The minimum set of metadata elements, the so-called IMMC core metadata, that is to be used in the data exchange.
The ability of systems to easily exchange information and use the exchanged information.
An important metadata element is the legal notice for your data. Use of content catalogued in the EU Open Data Portal is permitted free of charge for commercial or non-commercial purposes. According to its copyright notice, 'Reuse is authorised provided the source is acknowledged,' unless otherwise stated. This follows the principles of the reuse policy implemented through Directive 2013/37/EU and Decision 2011/833/EU. In the same way, reuse should fully respect privacy legislation and does not apply to data subject to the intellectual property rights of third parties. In limited cases it can be subject to conditions (Article 2(2) of Decision 2011/833/EU).
Linked data describes a method of publishing structured data so that they can be interlinked. It builds upon standard web technologies such as HTTP and URI, but rather than using them to serve web pages for human readers it extends them to share information in a way that can be automatically read by computers.
Linked data is one of the core pillars of the 'semantic web', also known as the 'web of data'. The semantic web is about making links between datasets that are understandable not only to humans, but also to machines, and linked data provides the best practices for making these links possible. In other words, linked data is a set of design principles for sharing machine-readable interlinked data on the web.
Linked data principles
Linked data principles provide a common API for data on the web that is more convenient than many separately and differently designed APIs published by individual data suppliers. Tim Berners-Lee, the inventor of the web and the initiator of the linked data project, proposed the following principles upon which linked data is based:
use URIs to name things;
use HTTP URIs so that things can be referred to and looked up (dereferenced) by people and user agents;
when someone looks up a URI, provide useful information using open web standards such as RDF or SPARQL;
include links to other related things using their URIs when publishing on the web.
Machine-readable data are data in a format that can be interpreted by a computer program. There are two types of machine-readable data:
human-readable data that are marked up so that they can also be understood by computers, for example, microformats and RDFa;
data formats intended principally for computers, for example, RDF, XML and JSON.
The combination of multiple datasets from multiple sources to create a new service, visualisation, or information.
Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve, use or manage an information resource. Metadata is often referred to as data about data.
Metadata is important for many reasons, most specifically to:
enable high ranking of search results;
enable refinement of a search;
help organise electronic datasets;
provide digital identification;
support archiving and preservation;
facilitate interoperability, that is, the ability of systems to exchange information and use the exchanged information.
Why is the quality of metadata important?
Metadata is the first indicator for a qualitative assessment of a dataset, as it provides information about the content and the quality of the data. In short, metadata enables users to discover the data and understand the structure of the data, the terms under which it can be reused and its origin.
The metadata registry is an important interoperability and standardisation tool. It registers and maintains definition data (metadata elements, named authority lists, schemas, etc.) used by the different European institutions.
The practice of examining large pre-existing databases to generate added information.
A standard vocabulary (i.e., EuroVoc) which can be easily translated to other languages. For international interoperability it is useful to use multilingual thesauri.
A formal model that allows knowledge to be represented for a specific domain. An ontology describes the types of things that exist (classes), the relationships between them (properties) and the logical ways those classes and properties can be used together (axioms).
Open government data
Data collected, produced or paid for by the public bodies and made freely available for use for any purpose.
Generally understood as technical standards that are free from licencing restrictions. They can also be interpreted to mean standards that are developed in a vendor-neutral manner.
Breaking a data block into smaller chunks by following a set of rules so that it can be more easily interpreted, managed or transmitted by a computer.
PDF (Portable Document Format)
A file format used to present and exchange documents independently of software, hardware, or operating systems. It is an open standard maintained by the International Organisation for Standardisation.
PSI (public sector information)
It is the wide range of information that public sector bodies collect, produce, reproduce and disseminate in many areas of activity while accomplishing their institutional tasks. It can be made available under a variety of (not always open) licences.
An expression that refers to data in their original state, not having been processed, aggregated or manipulated in any other way. It is also defined as 'primary data'.
RDF (Resource Description Framework)
A family of international standards for data interchange on the web. RDF is based on the idea of identifying things using web identifiers or HTTP URIs and describing resources in terms of simple properties and property values.
Resource description framework in attributes is a W3C recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within web documents.
The physical representation of a dataset. Each resource can be a file of any kind, a link to a file elsewhere on the web or a link to an API. For example, if the data is being supplied in multiple formats or split into different areas or time periods, each file is a different 'resource' that should be described individually.
An evolution or part of the web that consists of machine-readable data in RDF and an ability to query that information in standard ways (e.g. via SPARQL).
The process of extracting data in machine-readable formats of non-pure data sources, for example webpages or PDF documents. The term is often prefixed with the source (e.g. web scraping, PDF scraping).
SDMX (statistical data and metadata exchange)
An international initiative that aims at standardising and modernising the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries.
Shapes Constraint Language (SHACL) is a W3C specification for validating graph-based data against a set of conditions. Among others, SHACL includes features to express conditions that constrain the number of values that a property may have, the type of such values, numeric ranges, string matching patterns and logical combinations of such constraints. SHACL also includes an extension mechanism to express more complex conditions in languages such as SPARQL.
SPARQL protocol and RDF query language (SPARQL) defines a query language for RDF data, analogous to the Structured Query Language (SQL) for relational databases.
A service that accepts SPARQL queries and returns answers as SPARQL result sets. It is a best practice for dataset providers to give the URL of their SPARQL endpoint to allow access to their data programmatically or through a web interface.
The StatDCAT application aims at providing a commonly agreed dissemination vocabulary for statistical open data. StatDCAT-AP defines a certain number of additions to the DCAT-AP model that can be used to describe datasets in any format, for example, those published in SDMX, a standard for the exchange of statistical data.
The principal objective of the development of the StatDCAT-AP, which is funded under the ISA^2^ action of the European Commission on 'Promoting semantic interoperability amongst the European Union Member States (SEMIC)', is to facilitate a better integration of the existing statistical data portals within open data portals, thus improving the discoverability of statistical datasets across domains, sectors and borders. This will be beneficial for the general data portals, enabling enhanced services for the discovery of statistical data.
Data that reside in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data. Although data in XML files are not fixed in location like traditional database records, they are still structured because the data are tagged and can be accurately identified.
A triplestore is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject-predicate-object, like 'Bob is 35' or 'Bob knows Fred'. Much like a relational database, information is stored in a triplestore and retrieved via a query language. Unlike a relational database, a triplestore is optimised for the storage and retrieval of triples. In addition to queries, triples can usually be imported/exported using RDF and other formats.
URI (uniform resource identifier)
A string that uniquely identifies virtually anything, including a physical building or more abstract concepts such as colours. It may or may not be resolvable on the web.
URL (uniform resource locator)
A global identifier commonly called a 'web address'. A URL is resolvable on the web. All HTTP URLs are URIs; however, not all URIs are URLs.
A collection of terms for a particular purpose. Vocabularies can range from simple, such as the widely used RDF schema, FOAF and DCMI element set to complex vocabularies with thousands of terms, such as those used in healthcare to describe symptoms, diseases and treatments. Vocabularies play an especially significant role in linked data, specifically to help with data integration. The use of this term overlaps quite often with that of 'ontology', see Section 6.29.
XML (Extensible Markup Language)
It is a markup language that defines a set of rules for encoding documents in a format which is both human readable and machine readable. See the standard here: https://www.w3.org/XML/