SemStats

Statistical classifications

Author
Franck Cotton2
Document ID
http://semstats.org/2016/challenge/classifications
Published
Modified
License
CC BY 4.0

Keywords

Introduction

The SemStats 2016 challenge data sets includes data about the central economic classifications published by the UNSD and Eurostat, as well as some national classifications that are articulated with the central classifications. The international economic classifications form a very coherent system which is, for example, described briefly on this page.

For the challenge data, a subset of this central system is made available as RDF, completed with national classifications. Previous versions of the central classifications are also included in order to provide use cases related to the evolution in time of the classification system.

More precisely:

  • The classification of economic activities at UN level is the ISIC. The last two versions of ISIC, revisions 3.1 and 4 are included, as well as the historical correspondence table between them.
  • At the European statistical system (ESS) level, a refinement of the ISIC, called NACE, is used. Here again, the last two versions are included in the challenge data: NACE revision 1.1 and NACE revision 2. The historical correspondence between these two versions is also included, as well as the hierarchical correspondences between NACE revision 2 and ISIC revision 4.
  • The following national activity classifications for the following countries are also provided in their latest versions: France and Italy (NAF and Ateco with their ierarchical correspondence to NACE), and US (NAICS and its correspondence with ISIC).
  • Regarding products, the UN classification is the CPC (Central Products Classification). The last two versions of the CPC (versions 2 and 2.1) are articulated with ISIC Rev. 4. The previous version (1.1) is articulated with ISIC Rev. 3.1. All these classifications and tables are included in the data sets.
  • The ESS classification for products is the CPA. Its articulation with NACE is similar to what exists between CPC and ISIC: the last two versions of the CPA (versions 2008 and 2.1) are articulated with NACE Rev. 2, and the previous version (2002) is articulated with NACE Rev. 1.1. Only versions 2008 and 2.1 of the CPC are included, with their historical correspondence and their correspondences with NACE Rev. 2.

The following picture gives an overview of the challenge data: the yellow dots represent the classifications and the yellow lines the correspondence tables.

Challenge data overview

Challenge data overview

The next sections give more details on the sources and the production processes used to create the RDF files.

UN-level classifications: ISIC and CPC

Introduction

The latest versions of ISIC and CPC (ISIC Rev.4 and CPC Ver.2.1 at the time of writing), as well as the correspondences between items of those versions are included in the data. Also, in order to provide use cases related to the evolution in time of classifications, it was decided to include the previous versions of both classifications, as well as the associated historical correspondences. Because of the correspondence structure between ISIC and CPC, this implies in fact to include in the system the last three versions of the CPC, which align with the last two versions of the ISIC.

The authoritative source for the UNSD classifications is the UNSD classification registry, and in particular the download page at http://unstats.un.org/unsd/cr/registry/regdnld.asp. The information is available in several formats and languages. Only PDF, MS Access and online HTML have the explanatory notes, and only Access and HTML have the latest corrections. Additional sources provided by the UNSD include, amongst others, French and Spanish labels and correspondence tables (which are only available in English).

Extraction tools

MS Access is a proprietary format, but there are free and open source tools to read it, so it seems logical to use the Access files as main data sources. Jackcess>, which is licensed under the Apache License, Version 2.0, is used to read the databases. The additional labels and correspondence tables are given in CSV files and can be read easily with Apache CSV Commons. Note that the CSV files containing additional languages are encoded using the ANSI (Cp1252) character set; this is expected by the program.

Details on the sources

For ISIC Rev.4, the following sources were used:

The corresponding files for ISIC Rev.3.1 are:

French labels for ISIC Rev.3.1 do not seem to be available.

For CPC, the following sources were used:

Regarding correspondence tables (all files are zipped CSV):

The following precisions are copied from the readme files available in the archives:

  • The correspondence between CPC versions 2 and 2.1 does not yet include divisions 61 and 62 of the CPC.
  • The correspondence between ISIC Rev.4 and CPC Ver.2.1 does not yet include divisions 45, 46 and 47 of ISIC.
  • The correspondence between ISIC Rev.4 and CPC Ver.2 does not yet include divisions 45-47 of ISIC. Note also that waste products of the CPC (i.e. those in division 39) are not linked to an ISIC industry.
  • Regarding the correspondence between ISIC Rev.3.1 and CPC Ver.1.1, please note that certain products in the CPC (e.g. waste products in CPC division 39) are not linked to specific industries.

All files should be put in the main/resources/data folder and unzipped before running the programs.

Details on the outputs

The following Turtle files are produced by the programs:

  • isicr31.ttl, isicr4.ttl, cpcv11.ttl, cpcv2.ttl and cpcv21.ttl correspond to the classification schemes for ISIC Rev.3.1, ISIC Rev.4, CPC Ver.1.1, CPC Ver.2 and CPC Ver.2.1
  • isicr31-isicr4.ttl, isicr31-cpcv11.ttl, isicr4-cpcv2.ttl, isicr4-cpcv21.ttl, cpcv11-cpcv2.ttl and cpcv2-cpcv21.ttl contain the correspondence tables between ISIC Rev.3.1 and ISIC Rev.4, ISIC Rev.3.1 and CPC Ver.1.1, ISIC Rev.4 and CPC Ver.2, ISIC Rev.4 and CPC Ver.2.1, CPC Ver.1.1 and CPC Ver.2, and CPC Ver.2 and CPC Ver.2.1 respectively

Eurostat classifications: NACE and CPA

Introduction

The NACE and CPA are the central classifications of economic activities and products in the European Statistical System. They are consequently included in the project perimeter. More precisely, the RDF data contain the last two versions of both classifications (NACE Rev. 1.1 and Rev. 2, and CPA Ver. 2008 and Ver. 2.1), as well as the historical correspondences between the two revisions of NACE and between the two versions of CPA, and the correspondences between NACE Rev. 2 and the two versions of CPA. Additionally, the correspondence between ISIC Rev.4 and NACE Rev.2 is also included.

The authoritative source for the Eurostat classifications is RAMON. The information is generally available in HTML, CSV and XML. The latter seems to be preferable for the main files giving the structure, labels and notes, whereas CSV can be used for simpler files like correspondence tables.

Extraction tools

A simple way of processing the XML files is to use XSL transformations to produce the XML representation of the target RDF data. Apache CSV Commons can be used to process the CSV files.

Unfortunately, XML files produced by RAMON have a little defect: they start with a blank line, which makes them invalid. To avoid manual manipulation, a simple Java program was written which deletes the blank line, executes the XSL transformation and finally converts the RDF/XML result into a Turtle file. This last step allows to produce the same format as for the other sources, and is also useful to validate the outputs of the XSL transformations.

Details on the sources

RAMON produces the downloadable files on demand and adds a timestamp to the file name (for example CPA_2_1_20160314_114049.xml). This is not very handy in a perspective of automation where files names should be deterministic. To circumvent this difficulty, the loading programs must select their inputs based on a file name filter (CPA_2_1_*.xml for the previous example). In case several files correspond to an expression, the most recent file will be used.

Using this file naming convention, the following sources were used:

  • NACE Rev. 2 in XML can be downloaded from the NACE Rev. 2 page on RAMON.
  • NACE Rev. 1.1 in XML can be downloaded from the NACE Rev. 1.1 page on RAMON.
  • CPA Ver. 2.1 in XML can be downloaded from the CPA Ver. 2.1 page on RAMON.
  • CPA Ver. 2008 in XML can be downloaded from the CPA Ver. 2008 page on RAMON.
  • The correspondence between NACE Rev. 1.1 and NACE Rev 2 is available from this page
  • The correspondence between CPA Ver. 2008 and CPA Ver. 2.1 is available from this page.
  • The correspondence between NACE Rev 2 et CPA Ver. 2.1 is not explicitly available from RAMON, but the it can easily be produced since both classifications are completely aligned down to the class level. The same goes for the correspondence between NACE Rev 2 et CPA Ver. 2.1.
  • The correspondence between ISIC Rev.4 and NACE Rev. 2 can be downloaded from this page on RAMON, and is also accessible on the UNSD web site as file ISIC4-NACE2.zip. The UNSD version gives more information, for example indicators of partial coverage for links.
  • The correspondence between CPA Ver. 2.1 and CPC Ver.2.1 does not seem to be available from RAMON nor from the UNSD web site. It could be a use case of the project to generate this table.

Details on the outputs

The following Turtle files are produced by the programs:

  • nacer11.ttl, nacer2.ttl, cpav2008.ttl and cpav21.ttl correspond to the classification schemes for NACE Rev. 1.1, NACE Rev. 2, CPA Ver. 2008 and CPA Ver. 2.1
  • nacer11-nacer2.ttl, cpav2008-cpav21.ttl, nacer2-cpav2008.ttl and nacer2-cpav21.ttl contain the correspondence tables between NACE Rev. 1.1 and NACE Rev. 2, CPA Ver. 2008 and CPA Ver. 2.1, NACE Rev. 2 and CPA Ver. 2008 and NACE Rev. 2 and CPA Ver. 2.1.
  • isicr4-nacer2.ttl contains the correspondence table between ISIC Rev.4 and NACE Rev. 2.

National classifications: NAICS, Ateco, NAF and CPF

Introduction

The international classifications of economic activities and products are in general refined or adapted in each country in order to fit the local needs. In other cases, local classifications may have specific structures, but are linked to UNSD classifications by correspondence tables. We include here different examples of national classifications:

  • NAICS 2012
  • Ateco 2007
  • NAF rév. 2
  • CPF rév. 2.1

The authoritative sources for national classifications is generally the country's NSI (national statistical institute). For NAICS, we use the publication made by the US Census Bureau; Ateco is found on Istat's web site and NAF and CPF on Insee's web site. Ateco, NAF and CPF are already published by Istat and Insee, so we take those "as-is" even if the modeling can differ from the one used for the rest of the classification. All those the data are available in Excel.

Extraction tools

Apache POI is used to read the Excel files and Apache Jena to produce the RDF datasets.

Details on the sources

The following sources are used to produced the RDF data:

  • The NAICS 2012 Excel spreadsheet can be downloaded at this URL and the correspondence between ISIC Rev.4 and NAICS 2012 is available here.
  • Ateco 2007, NAF rév. 2 and CPF rév. 2.1 were directly provided by Istat and Insee (files ateco2007.rdf, naf08.rdf and cpf15.rdf).
  • The hierarchical correspondence between NACE Rev. 2 and Ateco 2007 can be calculated from the files contained in this archive, and more specifically from the ateco_struttura_17dicembre_2008.xls Exel file (what is needed is just the list of Ateco codes).
  • The hierarchical correspondence between NACE Rev. 2 and NAF rév. 2 can be calculated from the list of NAF "sous-classes" available in this spreadsheet.
  • Likewise, the hierarchical correspondence between CPA Ver. 2 and CPF rév. 2 can be calculated from the list of CPF "sous-catégories" available in this spreadsheet.

Details on the outputs

The following Turtle files are produced by the programs:

  • naics2012.ttl contains the classification scheme for NAICS 2012. . ateco2007.rdf, nafr2.ttl and cpfr21.ttl contain the classification schemes for Ateco 2007, , NAF rév. 2 and CPF rév. 2.1. These files result directly from the sources communicated by Istat and Insee through basic operations (change of base URI or of serialization format).
  • isicr4-naics2012.ttl, nacer2-nafr2.ttl and cpav21-cpfr21.ttl contain the correspondence tables between ISIC Rev.4 and NAICS 2012, NACE Rev. 2 and NAF rév. 2, and CPA Ver. 2.1 and CPF rév 2.1 respectively.

Vocabularies

The main RDF vocabulary used for representing classifications is XKOS, published by the DDI Alliance. XKOS extends SKOS and is aligned with the Neuchâtel model and the GSIM classification model, which are the business models used by statisticians.