Local index workflow

NistChemPy no longer ships a prebuilt NIST Chemistry WebBook index. The local index is a user-generated cache that helps with broad compound discovery, property-availability filtering, and local search.

The local index is not an official NIST product and is not covered by the NistChemPy software license. It is a local artifact created by the user from NIST Chemistry WebBook pages.

What the index contains

A completed local index is a CSV table, usually named index.csv. It contains compound identifiers, basic metadata, structure-file links, and WebBook section availability URLs. Typical columns include:

  • ID

  • name

  • synonyms

  • formula

  • mol_weight

  • inchi

  • inchi_key

  • cas_rn

  • mol2D and mol3D

  • section columns such as Mass spectrum (electron ionization) and Gas Chromatography

The tiny example_index.csv fixture used in the documentation has the same kind of column layout, but it is not a replacement for a locally generated index.

Where the index is stored

By default, NistChemPy stores the local index in the platform-specific user cache directory under nistchempy/webbook-index. The exact path depends on the operating system.

Print the resolved default path with:

nistchempy index path

The same path is available from Python:

import nistchempy as nist

nist.WebBookIndex.default_path()

Use --path to build or read an index at a project-local location:

nistchempy index build --path ./webbook-index --accept-data-terms
nistchempy index search benzene --path ./webbook-index

Project-local index directories should normally be added to .gitignore. Do not commit generated full indexes, raw page caches, or other large WebBook-derived artifacts to public repositories.

How to check status

Use the status command if you built an index earlier and forgot where it is, or if you want to check whether a build completed:

nistchempy index status
nistchempy index status --path ./webbook-index

How the index is formed

A full local-index build has two conceptual stages:

discovery strategy -> seeds.csv -> compound-page enrichment -> index.csv

The discovery stage finds candidate WebBook compounds and writes seeds.csv. The enrichment stage visits compound pages and extracts metadata, structure links, and section availability URLs into index.csv.

Supported discovery strategies are:

formula-browser

Traverses the WebBook formula browser and is the default general-purpose strategy.

sitemap

Reads WebBook sitemap files when available. It is useful as an audit or supplementary source.

formula-search

Uses bounded formula-search subdivision. This strategy requires explicit formula bounds and records unresolved query regions for later inspection.

Build the default index with:

nistchempy index build --accept-data-terms

Warning

A full section-availability index can require visiting one WebBook page per compound.

With a polite 3 second delay and roughly 100,000-150,000 pages, an initial rebuild can take about 3.5-5+ days before retries and network overhead.

Use --path for an explicit cache location and rerun the command to resume interrupted enrichment work.

Importing an existing local CSV

If you already have a local index CSV, import it into the cache layout:

nistchempy index build \
  --from-csv /path/to/index.csv \
  --path ./webbook-index \
  --accept-data-terms

This does not make the CSV redistributable. It only records it as a local user artifact in NistChemPy’s current cache layout.

Using the index from Python

import nistchempy as nist

index = nist.get_local_index('./webbook-index')
index.search('benzene')
index.available_properties('C71432')