SemStats

Document ID: http://semstats.org/

Published: 2015-04-28

Modified: 2019-03-13

License: CC BY 4.0

Keywords

SemStats editions

Click on the links below for more information on the different editions of the SemStats workshop:

Workshop motivation

There is a growing interest regarding linked data and the semantic web in the statistical community. A large amount of statistical data from international and national agencies has already been published on the web of data, for example Census data from Ireland, Italy or France amongst others. In most cases, though, this publication is done by actors exterior to the statistical office (see in particular http://270a.info/, http://eurostat.linkedstatistics.org/ or http://linkedstatistics.gr/), which raises issues such as long-term URI persistence, institutional commitment and data maintenance.

Statistical organizations also possess an important corpus of structural metadata such as concept schemes, thesauri, code lists and classifications. Some of those are already available as linked data, generally in SKOS format (e.g. FAO’s Agrovoc or UN’s COFOG). Semantic web standards useful for the statisticians have now arrived at maturity. The best examples are the W3C Data Cube, DCAT and ADMS vocabularies. The statistical community is also working on the definition of more specialized vocabularies, especially under the umbrella of the DDI Alliance. For example, XKOS extends SKOS for the representation of statistical classifications, and Disco defines a vocabulary for data documentation and discovery. The Visual Analytics Vocabulary is a first step towards semantic descriptions for user interface components developed to visualize Linked Statistical Data which can lead to increased linked data consumption and accessibility. We are now at the tipping point where the statistical and the Semantic Web communities have to formally exchange in order to share experiences and tools and think ahead regarding the upcoming challenges.

The web of data will benefit in getting rich data published by professional and trustworthy data providers. It is also important that metadata maintained by statistical offices like concept schemes of economic or societal terms, statistical classifications, well-known codes, etc., are available as linked data, because they are of good quality, well-maintained, and they constitute a corpus to which a lot of other data can refer to. Statisticians have a long-going culture of data integrity, quality and documentation. They have developed industrialized data production and publication processes, and they care about data confidentiality and more generally how data can be used.

Challenges tackled by the workshop

It seems that after a period where the aim was to publish as many triples as possible, the focus of the Semantic Web community is now shifting to having a better quality of data and metadata, more coherent vocabularies (see the LOV initiative), good and documented naming patterns, etc. This workshop aims to contribute in these longer term problems in order to have a significant impact.

The statistics community faces sometimes challenges when trying to adopt Semantic Web technologies, in particular:

Difficulty to create and publish linked data: this can be alleviated by providing methods, tools, lessons learned and best practices, by publicizing successful examples and by providing support.
Difficulty to see the purpose of publishing linked data: we must develop end-user tools leveraging statistical linked data, provide convincing examples of real use in applications or mashups, so that the end-user value of statistical linked data and metadata appears more clearly.
Difficulty to use external linked data in their daily activity: it is important to develop statistical methods and tools especially tailored for linked data, so that statisticians can get accustomed to using them and get convinced of their specific utility.

To conclude, statisticians know how misleading it can be to exploit semantic connections without carefully considering and weighing information about the quality of these connections, the validity of inferences, etc. A challenge for them is to determine, to ensure and to inform consumers about the quality of semantic connections which may be used to support analysis in some circumstances but not others. The workshop will enable participants to discuss these very important issues.