RDKit Conformer Generation

This skill provides practical command patterns for RDKit 3D/2D conformer generation using the standardized CLI wrapper: <skill_path>/scripts/rdkit_conf_helper.py.

Key behaviors (important for Agents):

The script prints environment detection (Python/RDKit/Pandas) by default.
Multi-conformer sampling: embeds --num-confs conformers (default 10) per molecule via EmbedMultipleConfs, optimizes each with the chosen force field, and keeps the lowest-energy one. Set --num-confs 1 to revert to single-conformer behavior.
2D fallback: if all 3D embedding attempts fail, Compute2DCoords is used instead and a [WARN] line is printed to stderr for that molecule.
Bad/illegal SMILES are skipped entirely and logged to *.skipped.csv (no crash).
Molecules that fell back to 2D are additionally logged to *.fallback.csv.
Each run ends with a summary line and absolute output paths:
- [INFO] Done: <N_3d> 3D, <N_2d> 2D-fallback, <N_skip> skipped (total input: <N>)
- [RESULT] conf_sdf=/abs/path.sdf
- [RESULT] conf_xyz=/abs/path.xyz
- [RESULT] fallback_csv=/abs/path.fallback.csv (only if any 2D fallbacks occurred)
- [RESULT] skipped_csv=/abs/path.skipped.csv (only if any SMILES were skipped)

Quick Start

Check CLI help:

uv run <skill_path>/scripts/rdkit_conf_helper.py --help
uv run <skill_path>/scripts/rdkit_conf_helper.py conf --help

Disable environment printing (optional):

uv run <skill_path>/scripts/rdkit_conf_helper.py --no-env conf --smiles "CCO" --output out.sdf

Core Tasks

1) Generate 3D conformers (SDF output, default)

Single SMILES:

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --smiles "CCO" \
    --output /tmp/CCO.sdf

Single SMILES with a custom molecule name:

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --smiles "c1ccccc1" \
    --name benzene \
    --output /tmp/benzene.sdf

From CSV (default SMILES column: smiles):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv \
    --smiles-col smiles \
    --output data.sdf

From CSV with a name column:

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv \
    --smiles-col smiles \
    --name-col compound_id \
    --output data.sdf

From SMI (second token per line is used as name automatically):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file molecules.smi \
    --output molecules.sdf

2) Control conformer sampling count

Default (10 conformers sampled, lowest-energy kept):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --output data.sdf

Single conformer (fastest, least thorough):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --num-confs 1 --output data.sdf

Increase sampling for flexible or macrocyclic molecules:

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --num-confs 50 --output data.sdf

3) Choose force-field minimization

MMFF94s (default, falls back to UFF if unavailable):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --ff mmff94s --output data.mmff.sdf

UFF (universal force field):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --ff uff --output data.uff.sdf

Skip force-field optimization (raw ETKDG geometry only):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --ff none --output data.etkdg_raw.sdf

4) XYZ output

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv \
    --format xyz \
    --output data.xyz

5) Tuning embedding for difficult molecules

Large or macrocyclic molecules sometimes fail standard ETKDG; try random initial coordinates:

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file macrocycles.csv \
    --use-random-coords \
    --max-attempts 500 \
    --output macrocycles.sdf

Use a different random seed (reproducibility):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --seed 123 --output data.seed123.sdf

Non-deterministic embedding (seed = -1):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --seed -1 --output data.sdf

6) Suppress hydrogen addition

By default explicit H atoms are added before embedding for more accurate 3D geometry. Use --no-hs to keep the molecule as-is (heavy atoms only):

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv --no-hs --output data.noh.sdf

7) Custom log file paths

uv run <skill_path>/scripts/rdkit_conf_helper.py conf \
    --file data.csv \
    --output data.sdf \
    --error-log logs/skipped.csv \
    --fallback-log logs/used_2d.csv

3D Embedding Pipeline Details

For each molecule, the script runs the following steps in order:

Parse SMILES via Chem.MolFromSmiles.
Add hydrogens (Chem.AddHs) — skipped with --no-hs.
Multi-conformer 3D embedding (EmbedMultipleConfs, --num-confs candidates, default 10): tries ETKDGv3, then ETKDGv2, then ETDG, then ETDG+useRandomCoords as a fallback chain until at least one conformer is embedded.
Force-field minimization (if --ff is not none): each successfully embedded conformer is individually optimized. MMFF94s transparently falls back to UFF if parameters are unavailable for that molecule.
Lowest-energy selection: the conformer with the minimum post-optimization energy is retained; all others are discarded. If --ff none, the first embedded conformer is kept without energy ranking.
2D fallback (if all 3D attempts yield zero conformers): generates a flat 2D layout via Compute2DCoords (Z=0 for all atoms), prints a [WARN] to stderr, and records the molecule in the fallback log.

Output Format Notes

SDF output (--format sdf, default):

Standard V2000 multi-molecule SDF, one conformer per molecule.
Molecule name (from --name, --name-col, or auto-generated mol_<i>) is written to the SDF header line.
Compatible with most cheminformatics tools (RDKit, OpenBabel, Schrodinger, etc.).

XYZ output (--format xyz):

Concatenated XYZ blocks (element, x, y, z per atom).
Molecule name is written as the comment line (second line of each block).
Coordinates are in Angstroms.
Note: if --no-hs is used, hydrogen atoms are absent from the XYZ.

Fallback log (*.fallback.csv):

Written only when at least one molecule fell back to 2D.
Columns: idx, smiles, name, dim (always 2), ff (always 2d_fallback), note.

Skipped log (*.skipped.csv):

Written only when at least one SMILES was skipped.
Columns: idx, smiles, error.

Agent Checklist

When using this skill for users:

Confirm input format:
- .csv requires a SMILES column (default smiles)
- .smi uses the first token per line as SMILES, second token (if present) as name
Quote SMILES containing special shell characters (brackets/parentheses):
- Example: --smiles "[C@@H](O)(F)Cl"
For CSV workflows, verify column names:
- --smiles-col for the SMILES column
- --name-col (optional) for molecule identifiers to embed in SDF/XYZ headers
Check the [INFO] Done: summary line for the 3D/2D/skip breakdown.
If 2D fallbacks occurred, inspect *.fallback.csv:
- Consider --use-random-coords or --max-attempts tuning for the affected SMILES.
- 2D conformers have Z=0 and are not suitable for 3D-based applications (docking, 3D QSAR).
Always capture absolute output paths:
- Look for [RESULT] ...=/abs/path in stdout.
If debugging is needed, enable full traceback:
- RDKIT_CONF_HELPER_TRACE=1 uv run <skill_path>/scripts/rdkit_conf_helper.py ...

References

RDKit conformer generation guide: https://www.rdkit.org/docs/GettingStartedInPython.html#working-with-3d-molecules
ETKDG paper: Riniker & Landrum, J. Chem. Inf. Model. 2015, 55, 2562
ETKDGv3: Wang et al., J. Chem. Inf. Model. 2020, 60, 2044

rdkit-conf

Installation