Indexing data
Version history:
NOT_INDEXED
annotation for dimensions and dimension values introduced with April 4, 2024 Release .Stat Suite JS zoo
‘GET sfs report’ query is replaced by ‘Get sfs logs’ query with December 5, 2022 Release .Stat Suite JS spin
Index an individual dataflow since July 8, 2021 Release .Stat Suite JS 9.0.0
Introduction of the Collection concept with Solr with May 19, 2021 Release .Stat Suite JS 8.0.0
Indexation of externally defined dataflows since November 30, 2020 Release .Stat Suite JS 6.1.0
Rules of indexations enhanced with June 23, 2020 Release .Stat Suite JS 5.1.0
Delete an individual dataflow since February 28, 2020 Release .Stat Suite JS 4.0.0
Table of content
Before indexing data
Before indexing your data for the first time, SOLR collection(s) need to be created. See below how to do this.
How to create a SOLR collection
SFS requires one SOLR collection per organisation
(see new tenant model). The collection names must match the organisation
names. Those SOLR collections allow to separate the searchable data between organisations
.
Creating your SOLR collections is easy, you just need to execute the setup script documented here. It will automatically create the SOLR collections for all your organisations
as previously defined in your tenants.json
file.
If you are in a mono-tenant installation mode using the ‘default’ organisation
, then only one ‘default’ collection is created.
What is indexed
The languages used for the search features are configurable per Data Explorer instance (part of the installation process), and there must be at least one language defined. If the below localised elements for a defined language are not available, then they are replaced by their corresponding IDs.
The information describing the dataflows are entirely retrieved from one or more SDMX endpoints (data sources) that need to be configured:
- ID of the data source
- Localised names of the data source (one per language defined)
- URI of the SDMX endpoint (if possible supporting https), e.g.
https://www.mywebiste.org/sdmx/rest
- Queries to get one or more hierarchical CategorySchemes together with the dataflows (that have been categorised in these CategorySchemes) to be indexed, e.g.
["categoryscheme/Agency/AgencyID/latest/?references=dataflow","categoryscheme/Agency/CategorySchemeID/latest/?references=dataflow","categoryscheme/EXT/all/latest/?references=dataflow"]
- The query to get the structure information, categorisations (in Categories of previous CategorySchemes) and content constraints (separately) for each dataflow can be derived from the URI adding the following:
dataflow/{agencyID}/{dataflowID}/{versionID}?references=all&detail=referencepartial
- Note that some SDMX APIs do not yet support the
detail=referencepartial
option.
The following pieces of information are retrieved for each dataflow:
- Agency, ID, version, localised names and localised descriptions of the Dataflow
- ID and localised names of the corresponding data source
- The date/time when the Dataflow data have last been updated: CURRENT_STATE it is the date/time when the dataflow (re)index was triggered
- ID and localised names of the CategorySchemes in which the dataflow is categorised
- ID and localised names and hierarchy position of the categories in which the dataflow is categorised
- IDs and localised names of the concepts used as dimensions, as well as the dimension IDs
- IDs, localised names and hierarchy position of the codes used as dimension values constrained by the Actual Content Constraints defined for the dataflow. Note that this Content Constraint also contains information about the Time Periods (Time dimension values) available for the dataflow. It allows defining a specific “Time Period” range facet.
- Last update date taken from the validFrom property of the currently valid actual content constraint (ActualCC) of the dataflow. ‘Currently valid’ means that the datetime is in the past and the corresponding validTo property is in the future or absent. If a valid validFrom property is not available then the datetime value is taken from the actual LAST_UPDATE annotation of the dataflow. Only if that is not available then it is the last dataflow indexing time.
In SDMX, dataflows are uniquely identified by data source, Agency, ID and Version. However, to avoid user confusion, the search does not distingish dataflows per Agency or per Version. Thus if there are dataflows, categorised for search indexing, with the same ID, but with different Agencies or different Versions, then only one of them is indexed (first version retrieved through dataflow/all/ID/latest
). In such a case, it is needed to create and categorise separate dataflows with different IDs.
If the same dataflow (same ID, whatever Agency or Version) is retrieved from different data sources, then they are indexed separately and appear in the search results as different dataflows, and the are distinguished by the data source which is visible when the dataflow information is expanded (according to the configuration settings).
Indexation rules
- Dataflows are indexed only if
- there is data associated to them. The data availability check is based on the
Actual Content Constraint
attached to the dataflow. It must be:- non-empty or
- non-present (for compatibility with SDMX web services not based on .Stat Suite).
- they have an appropriate localised name
- there is data associated to them. The data availability check is based on the
- Dataflow descriptions are indexed only if
- they have an appropriate localised name
- Dimensions of a dataflow are indexed only if
- the dimension values with available data do not exceed the limit defined in the
SFS
configuration parameterDIMENSION_VALUES_LIMIT
, which is by default set to1000
. It protects the search engine from too big codelists and prevents performance impacts. For more information see here. - the dimensions are not explicitely excluded from indexation by a
NOT_INDEXED
annotation set either- in the dimension definition of the DSD:
or
"annotations": [{ "type": "NOT_INDEXED" }]
- in the dataflow definition:
"annotations": [{ "type": "NOT_INDEXED", "title": "DIM3,DIM6,ATTR5,ATTR6" <-- These are the related dimension and attribute IDs }]
- in the dimension definition of the DSD:
- they have an appropriate localised name
- the dimension values with available data do not exceed the limit defined in the
- Dimension values of a dataflow are indexed only if
- there are data available for these values or, if the values are hierarchical parents in case their children values have data. For that purpose, the search indexing takes the current
Actual Content Constraint
of the dataflow, if available, into account. - the dimension values are not explicitely excluded from indexation by a
NOT_INDEXED
annotation set either- in the definition of the code in the underlying codelist:
Note that in this case, the concept’s facet could still have this value indexed for other dataflows if those use a different codelist where the code is not marked with annotation
"annotations": [{ "type": "NOT_INDEXED" }]
NOT_INDEXED
.
or - in the dataflow definition:
Note that in this case, the concept’s facet could still have this value indexed for other dataflows if the corresponding dimension values were not marked with annotation
"annotations": [{ "type": "NOT_INDEXED", "title": "DIM3=VALUE2+VALUE9,DIM6=VALUE_X+VALUE_Y" <-- These are the related dimension IDs and dimension value IDs }]
NOT_INDEXED
.
- in the definition of the code in the underlying codelist:
- they have an appropriate localised name
- there are data available for these values or, if the values are hierarchical parents in case their children values have data. For that purpose, the search indexing takes the current
- CategorySchemes are indexed only if
- they have an appropriate localised name
- Categories are indexed only if
- they have an appropriate localised name
Indexing externally defined dataflows
It is possible to index externally defined dataflows for browsing and searching in .Stat DE, in the case when the dataflow is stored only as stubs (without content, e.g. without the link to its DSD), meaning that the full definition and content of the corresponding dataflow is stored externally.
Therefore, the locally stored dataflow stub includes the references (URL link
) to the original external full dataflow definition, with the following artefact properties: isExternalReference=true
, and link {external structure link}
.
Note also that the locally stored dataflow stub must be categorised in the CategoryScheme that will be used by the index process.
Example of a dataflow stub definition with external reference:
<structure:Dataflows>
<structure:Dataflow id="DF_SDG_ALL_SDG_A871_SEX_AGE_RT" agencyID="ILO" version="1.0" isExternalReference="true" isFinal="true" structureURL="https://ilo.org/sdmx-test/rest/dataflow/ILO/DF_SDG_ALL_SDG_A871_SEX_AGE_RT/1.0">
<common:Name xml:lang="en">SDG indicator 8.7.1 - Proportion of children engaged in economic activity</common:Name>
<common:Description xml:lang="en">Estimates on economic activity among children aged 5-17...</common:Description>
</structure:Dataflow>
</structure:Dataflows>
When and how to index
Currently, the indexation is triggered manually. The following actions can be performed by a system administrator:
- (Re-)Index all currently relevant dataflows for all data sources (without removing entries for previously indexed dataflows)
- (Re-)Index one individual dataflow
- Delete all entries of previously indexed dataflows
- Delete the entry of one previously indexed dataflow
The examples provided below are made using the free version of the API platform Postman.
API format
The API is protected by an API key. In Docker-Compose installation, this key is defined as an environment variable named API_KEY_SFS
. See an example in Docker compose file. In the following examples, the configured API key is supposed to be xxx
.
API requests can be constructed as follow
- the URL root, e.g.,
https://sfs.myorg.org/
- the API name as folder,
/admin/
- the target of the request as folder,
/dataflows/
or/dataflow/
- the API key as
x-api-key
HTTP header, e.g.,x-api-key=xxx
- the tenant name as
x-tenant
HTTP header, e.g.,x-tenant=XXXXXX
- additional parameters depending on the request in a raw JSON-formatted request body
{ "dataspaceID": "disseminate", "agencyId": "XYZ", "id": "DF_ABC", "version": "1.0", "mode": "async" }
Notes
- If you are using the default tenant collection, then all calls without a tenant will use the value of
DEFAULT_TENANT
as a tenant, and thus specifying the tenant becomes optional. - Instead of HTTP headers and parameters in a raw JSON-formatted request body, it is still possible to use the corresponding URL query parameters for development of testing purposes in isolated environments, e.g.,
api-key
,tenant
,dataspaceID
,agencyId
,id
andversion
. However, for greater security, it is recommended to use the HTTP header parameters and (encrypted) body in all other circumstances.
For more details please see here.
(Re-)Index all dataflows
POST /admin/dataflows
This request re-indexes all dataflows from all configured sdmx endpoints. In details, it:
- requests all SDMX data sources
- adds dataflows to the Solr index
- updates the dynamic sfs schema depending on the added dataflows
Note that this action does not clean up the index, meaning that non-categorized or deleted dataflows in the data source that were previsouly indexed, will not be removed from the index. To do so, you first need to execute the Delete all dataflows action to clean up the index, and then run this action to re-index all dataflows.
Example:
POST https://sfs.myorg.org/admin/dataflows
x-api-key=xxx
x-tenant=XXXXXX
For more details please see here.
Index or update one individual dataflow
POST /admin/dataflow
This request re-indexes (inserts or updates) one individual dataflow.
Example:
POST https://sfs.myorg.org/admin/dataflow
x-api-key=xxx
tenant=XXXXXX
{
"dataspaceID": "staging:SIS-CC-reset",
"agencyId": "OECD",
"id": "AIR_EMISSIONS_DF",
"version": "1.0"
}
For more details please see here.
Delete all dataflows
DELETE /admin/dataflows
This request deletes all entries for previously indexed dataflows for all configured SDMX data sources.
Example:
DELETE https://sfs.myorg.org/admin/dataflows
x-api-key=xxx
x-tenant=XXXXXX
For more details please see here.
Delete one individual dataflow
DELETE /admin/dataflow
This request deletes one specific dataflow from the index and search. It is thus no longer avaibale in .Stat DE for search and visualisation.
Example:
DELETE https://sfs.myorg.org/admin/dataflow
x-api-key=xxx
tenant=XXXXXX
{
"dataspaceID": "staging:SIS-CC-reset",
"agencyId": "OECD",
"id": "AIR_EMISSIONS_DF",
"version": "1.0"
}
For more details please see here.
Other admin queries
API requests for logs and reports can be constructed as explained in API format, with the following changes:
- the target of the request as folder,
/logs/
,/report/
or/config/
- additional parameters depending on the request in a raw JSON-formatted request body
{ "id": 1669717223969, "userEmail": "test@siscc.org", "dataspaceID": "disseminate", "agencyId": "XYZ", "submissionStart": "2021-11-29T10:20:23.985Z", "submissionEnd": "2023-11-29T10:20:23.985Z", "executionStatus": "completed", "executionOutcome": "success", "artefact": { "resourceType": "dataflow", "agencyID": "XYZ", "id": "DF_ABC", "version": "1.0" } }
For more details please see here and here.
Get search logs
GET /admin/logs
or POST /admin/logs
Without filter parameters, this request returns all the search index management logs for all datasources and dataspaces. A number of filter parameters allow restricting the returned logs. See the full API documentation for details about those query parameters.
Examples:
-
Get all logs
GET https://sfs.myorg.org/admin/logs
x-api-key=xxx
-
Get the logs for a specific index management action, identified by id
POST https://sfs.myorg.org/admin/logs
x-api-key=xxx
{
"id": 1652792463242
}
Get in-memory indexing statuses
GET /admin/report
This request returns the in-memory indexing statuses
Example:
GET https://sfs.myorg.org/admin/report
x-api-key=xxx