.Stat Suite documentation

Indexing data

Version history:
‘GET sfs report’ query is replaced by ‘Get sfs logs’ query with December 5, 2022 Release .Stat Suite JS spin
Index an individual dataflow since July 8, 2021 Release .Stat Suite JS 9.0.0
Introduction of the Collection concept with Solr with May 19, 2021 Release .Stat Suite JS 8.0.0
Indexation of externally defined dataflows since November 30, 2020 Release .Stat Suite JS 6.1.0
Rules of indexations enhanced with June 23, 2020 Release .Stat Suite JS 5.1.0
Delete an individual dataflow since February 28, 2020 Release .Stat Suite JS 4.0.0

Table of Content


Before indexing data

Before indexing your data for the first time, if you are using Docker or on-premise installation strategy, then you must create a collection for each tenant.
Those collections allow to separate data between tenants.

How to create a collection

Creating a collection is easy, you just need to copy the url below and replace the name with your tenant.
For example, this url allows to create a collection that will be used by the oecd tenant.

http://localhost:8983/solr/admin/collections?action=CREATE&name=oecd&numShards=1&collection.configName=_default

Create as many collections as you have tenants. If you are in a mono-tenant installation, then create only one collection and name it “default”.

Important note: Always remember that your Collection name must match your tenant name.


What is indexed

The languages used for the search features are configurable per Data Explorer instance (part of the installation process), and there must be at least one language defined. If the below localised elements for a defined language are not available, then they are replaced by their corresponding IDs.

The information describing the dataflows are entirely retrieved from one or more SDMX endpoints (data sources) that need to be configured:

  • ID of the data source
  • Localised names of the data source (one per language defined)
  • URI of the SDMX endpoint (if possible supporting https), e.g. https://www.mywebiste.org/sdmx/rest
  • Queries to get one or more hierarchical CategorySchemes together with the dataflows (that have been categorised in these CategorySchemes) to be indexed, e.g. ["categoryscheme/Agency/AgencyID/latest/?references=dataflow","categoryscheme/Agency/CategorySchemeID/latest/?references=dataflow","categoryscheme/EXT/all/latest/?references=dataflow"]
  • The query to get the structure information, categorisations (in Categories of previous CategorySchemes) and content constraints (separately) for each dataflow can be derived from the URI adding the following: dataflow/{agencyID}/{dataflowID}/{versionID}?references=all&detail=referencepartial
  • Note that some SDMX APIs do not yet support the detail=referencepartial option.

The following pieces of information are retrieved for each dataflow:

  • Agency, ID, version, localised names and localised descriptions of the Dataflow
  • ID and localised names of the corresponding data source
  • The date/time when the Dataflow data have last been updated: CURRENT_STATE it is the date/time when the dataflow (re)index was triggered
  • ID and localised names of the CategorySchemes in which the dataflow is categorised
  • ID and localised names and hierarchy position of the categories in which the dataflow is categorised
  • IDs and localised names of the concepts used as dimensions, as well as the dimension IDs
  • IDs, localised names and hierarchy position of the codes used as dimension values constrained by the Actual Content Constraints defined for the dataflow. Note that this Content Constraint also contains information about the Time Periods (Time dimension values) available for the dataflow. It allows defining a specific “Time Period” range facet.

In SDMX, dataflows are uniquely identified by data source, Agency, ID and Version. However, to avoid user confusion, the search does not distingish dataflows per Agency or per Version. Thus if there are dataflows, categorised for search indexing, with the same ID, but with different Agencies or different Versions, then only one of them is indexed (first version retrieved through dataflow/all/ID/latest). In such a case, it is needed to create and categorise separate dataflows with different IDs.

If the same dataflow (same ID, whatever Agency or Version) is retrieved from different data sources, then they are indexed separately and appear in the search results as different dataflows, and the are distinguished by the data source which is visible when the dataflow information is expanded.

Conditions and exceptions

  • A dataflow is indexed only if there is data associated to it.
    The data availability check is based on the Actual Content Constraint attached to the dataflow. The dataflow is indexed only if there is:
    • a non-empty Actual Content Constraint
    • no Actual Content Constraint (for compatibility with SDMX web services not based on .Stat Suite).
  • A particular dimension of a dataflow is indexed only if the dimension values with available data do not exceed the limit defined in the SFS configuration parameter DIMENSION_VALUES_LIMIT, which is by default set to 1000. It protects the search engine from too big codelists and prevents performance impacts. For more information see here.

Indexing externally defined dataflows

It is possible to index externally defined dataflows for browse and search capabilities in .Stat DE, in the case when the dataflow is stored only as stubs (without content, e.g. without the link to its DSD), meaning that the full definition and content of the corresponding dataflow is stored externally.
Therefore, the locally stored dataflow stub includes the references (URL link) to the original external full dataflow definition, with the following artefact properties: isExternalReference=true, and link {external structure link}.
Note also that the locally stored dataflow stub must be categorised in the CategoryScheme that will be used by the index process.

Example of a dataflow stub definition with external reference:

<structure:Dataflows>
  <structure:Dataflow id="DF_SDG_ALL_SDG_A871_SEX_AGE_RT" agencyID="ILO" version="1.0" isExternalReference="true" isFinal="true" structureURL="https://ilo.org/sdmx-test/rest/dataflow/ILO/DF_SDG_ALL_SDG_A871_SEX_AGE_RT/1.0">
    <common:Name xml:lang="en">SDG indicator 8.7.1 - Proportion of children engaged in economic activity</common:Name>
    <common:Description xml:lang="en">Estimates on economic activity among children aged 5-17...</common:Description>
  </structure:Dataflow>
</structure:Dataflows>

When and how to index

CURRENT_STATE The index is a manual action usually performed by a sysadmin user, who can lively manage the index of the endpoints for updating the dataflows that are published in .Stat DE and available for search and visualisation.
The following individual actions are currently enabled for index:

  • Index all datflows for all data sources
  • Index one individual dataflow
  • Update an already indexed dataflow
  • Delete all dataflows
  • Delete one individual dataflow

The examples provided below are made using the free version of the API platform Postman.

API format

The API is protected by an API key named API_KEY. In Docker Compose installation, this key is defined as an env. variable named API_KEY_SFS. See an example in Docker compose file. In the following examples, the API key is xxx.

All requests need a header made of:

  • the api URL, e.g. https://sfs-qa.siscc.org/
  • a role, in all cases /admin/
  • the target of the request, e.g. /dataflows/, /dataflow/ or /config/
  • the api key, in these example ?api-key=xxx
  • (since May 19, 2021 Release .Stat Suite JS 8.0.0) a tenant, e.g. &tenant=test
  • actions’ variables, e.g. depending on the request spaceId and dataflow &id&agencyId&version

Note that, if you are using a defaulttenant collection, then all calls without a tenant will use the value of DEFAULT_TENANT as a tenant, and thus the request header &tenant=test becomes optional.

Index all dataflows

Example:
POST https://sfs-qa.siscc.org/admin/dataflows?api-key=xxx&tenant=xxx

Index all dataflows

This request indexes all dataflows from all configured sdmx endpoints. In details, it:

  • requests all sdmxDataSources
  • adds dataflows to Sorl Core
  • updates the existing dynamic sfs schema depending on added dataflows

Note that this POST method does not clean up the index, meaning that non-categorized or deleted dataflows in the data source that were previsouly indexed, will not be removed from the index. To do so, you first need to run the DELETE /admin/dataflows method to clean up the index, and then run the POST /admin/dataflows method to index.

Index or update one individual dataflow

Example:
POST https://sfs-qa.siscc.org/admin/dataflow?api-key=xxx&tenant=xxx&spaceId=staging:SIS-CC-reset&id=AIR_EMISSIONS_DF&agencyId=OECD&version=1.0

This request results in indexing one individual dataflow from the index and search.
The request will result in updating one individual dataflow if this dataflow was already indexed by Solr.

Note that, with this “upsert” request, the previous single PATCH https://sfs-qa.siscc.org/admin/dataflow?api-key=xxx&tenant=xxx&datasourceId=staging:SIS-CC-stable&dataflowId=DF_SDG_ALL_SDG_A871_SEX_AGE_RT&agencyId=ILO&version=1.0 query for updating an individual dataflow already indexed is no longer supported.

GET search sfs config

Delete all dataflows

Example:
DELETE https://sfs-qa.siscc.org/admin/dataflows?api-key=xxx&tenant=xxx

Delete all dataflows

This request results in deleting all dataflows from the index and search of Solr for all configured sdmxDataSources.

Delete one individual dataflow

Example:
DELETE https://sfs-qa.siscc.org/admin/dataflow?api-key=xxx&tenant=xxx&spaceId=staging:SIS-CC-reset&id=AIR_EMISSIONS_DF&agencyId=OECD&version=1.0

Delete one specific dataflow

This request results in deleting one specific dataflow from the index and search. It is thus no longer avaibale in .Stat DE for search and visualisation.


Admin queries

GET search sfs logs

Example:
GET https://sfs-qa.siscc.org/admin/logs?api-key=xxx

GET all search sfs logs

By default, this request returns all the search index management logs for all datasources and dataspaces. A number of filter parameters allow restricting the returned logs. See the full API documentation for details about those query parameters.

GET https://sfs-qa.siscc.org/admin/logs?api-key=xxx&id=1652792463242

GET search sfs logs by log id

This request returns the log for a specific search index management action defined by its ID.

GET search sfs config

Example:
GET https://sfs-qa.siscc.org/admin/config?api-key=xxx&tenant=xxx

GET search sfs config

This request returns the sfs dynamic configuration with full details on configUrl, data source(s), fields and indexed dataflows.

DELETE search sfs config

Example:
DELETE https://sfs-qa.siscc.org/admin/config?api-key=xxx&tenant=xxx