hdx-cli-toolkit – a tool to examine and update metadata in the Humanitarian Data Exchange

the HDX and CKAN logos

This post introduces the hdx-cli-toolkit – a tool written in Python to examine and update metadata in the Humanitarian Data Exchange (HDX) which is based on the CKAN data catalogue.

HDX is a project of the UN Office for the Coordination of Humanitarian Affairs (UNOCHA). It is a data catalogue focussed on sharing data relevant to relief efforts. The data come from a wide range of providers who can upload data and the related metadata either manually or using automated pipelines. A team at the Centre for Humanitarian Data manage HDX: approaching providers for new data, working on data quality, enhancing HDX, and writing pipelines to automatically add data. I was involved at the start of the HDX project in 2014.

The hdx-cli-toolkit grew out of my recent time as a consultant on a return to the HDX project. The team I was in received numerous internal queries relating to metadata on HDX that couldn’t easily be answered by using the HDX user interface or CKAN API but were amenable to solution using relatively small pieces of Python code. Rather than scatter around small pieces of one-off code I decided to write a command-line tool to collect together this code and make it re-useable by others.

I also use hdx-cli-toolkit as a point of reference for doing various operations in the Python language (configuring a Python project, command-line interfaces with the click library, tests with mocks, publishing to PyPI, configuring Visual Studio Code for Python development) as well as snippets of code for interacting with HDX which may be used in data pipeline code.

As an aside, my post on understanding Python project setup is the most popular one I have written by a large margin.

Overview

hdx-cli-toolkit supports the following commands:

  • configuration – Print configuration information to terminal
  • download – Download dataset resources (files) from HDX
  • get_organization_metadata – Get an organization id and other metadata
  • get_user_metadata – Get user id and other metadata
  • list – List datasets in HDX
  • print – Print datasets in HDX to the terminal
  • quickcharts – Upload QuickChart JSON description to HDX
  • remove_extras_key – Remove extras key from a dataset
  • scan – Scan all of HDX and perform an action
  • showcase – Upload showcase to HDX
  • update – Update datasets in HDX
  • update_resource – Update a resource in HDX

Installation

hdx-cli-toolkit is a Python application published to the PyPI package repository. It can be installed easily with:

pip install hdx_cli_toolkit

Users may prefer to make a global, isolated installation using pipx which will make the hdx-cli-toolkit commands available across all of their projects:

pipx install hdx_cli_toolkit

hdx-cli-toolkit can then be updated with:

pipx install --force hdx_cli_toolkit

hdx-cli-toolkit uses the hdx-python-api library, this requires the following to be added to a file called .hdx_configuration.yaml in the user’s home directory.

hdx_key_stage: "[an HDX API token from the staging HDX site]"
hdx_key: "[an HDX API token from the prod HDX site]"
default_organization: "[your organization]"

The default_organization is required for the configuration command and can be supplied using the --organization= command-line parameter. If not defined it will default to hdx.

A user agent (hdx_cli_toolkit_*) is specified in the ~/.useragents.yaml file with the * replaced with the user’s initials.

hdx-cli-toolkit:
    preprefix: [YOUR_ORGANIZATION]
    user_agent: hdx_cli_toolkit_ih

Usage

Details of the currently implemented commands can be revealed by running hdx-toolkit --help, and details of the arguments for a command can be found using hdx-toolkit [COMMAND] --help

A couple of simple invocations, this first one prints out all the metadata for a dataset in a readable JSON format:

hdx-toolkit print --dataset_filter=geoboundaries-admin-boundaries-for-nepal --with_extras

This one shows the value of a particular metadata element for a set of datasets. The list command works in conjunction with the update command in which case the --value option provides the value to update to:

hdx-toolkit list --organization=healthsites --dataset_filter=*al*-healthsites --hdx_site=stage --key=private --value=True

A detailed guide including many example invocations can be found in the USERGUIDE.md file

Get in touch!

Currently the hdx-cli-toolkit only works on HDX but with some modification it should work with any CKAN instance.

If you are interested in learning more about hdx-cli-toolkit or CKAN then please get in touch.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *