RDKit Molecular Featurization
This skill provides practical command patterns for RDKit descriptor and fingerprint extraction
using the standardized CLI wrapper: <skill_path>/scripts/rdkit_helper.py.
Key behaviors (important for Agents):
- The script prints environment detection (Python/RDKit/NumPy/Pandas) by default.
- Bad/illegal SMILES are skipped and logged to
*.skipped.csv(no crash). - Each run ends by printing absolute output paths like:
[RESULT] desc_csv=/abs/path.csv[RESULT] fp_npy=/abs/path.npy[RESULT] fp_csv=/abs/path.csv
Quick Start
Check CLI help:
uv run <skill_path>/scripts/rdkit_helper.py --help
Check subcommand help:
uv run <skill_path>/scripts/rdkit_helper.py desc --help
uv run <skill_path>/scripts/rdkit_helper.py fp --help
uv run <skill_path>/scripts/rdkit_helper.py list-desc --help
Disable environment printing (optional):
uv run <skill_path>/scripts/rdkit_helper.py --no-env desc --smiles "CCO" --output out.csv
Core Tasks
1) Compute physicochemical descriptors → .csv
Single SMILES (default preset: physchem, 25 descriptors):
uv run <skill_path>/scripts/rdkit_helper.py desc \
--smiles "CCO" \
--output /tmp/CCO.desc.csv
From CSV (default SMILES column is smiles):
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv \
--smiles-col smiles \
--output data.desc.csv
From SMI:
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file molecules.smi \
--output molecules.desc.csv
Choose a descriptor preset:
# Lipinski drug-likeness (6 descriptors: MolWt, MolLogP, NumHDonors, ...)
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv --preset lipinski --output data.lipinski.csv
# Extended physicochemical (25 descriptors, default)
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv --preset physchem --output data.physchem.csv
# Topological / graph indices (56 descriptors: BalabanJ, BertzCT, Chi*, PEOE_VSA*, ...)
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv --preset topological --output data.topo.csv
# All RDKit descriptors (~200 descriptors)
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv --preset all --output data.all_desc.csv
Select specific descriptors (overrides --preset):
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv \
--descriptors "MolWt,MolLogP,TPSA,NumHDonors,NumHAcceptors" \
--output data.custom.csv
Suppress merging back original CSV columns (output only smiles + descriptors):
uv run <skill_path>/scripts/rdkit_helper.py desc \
--file data.csv --preset physchem --no-merge --output data.desc_only.csv
2) Compute molecular fingerprints → .npy or .csv
Available fingerprint types:
| Type | Description | Default bits |
|---|---|---|
morgan2 | Morgan circular FP radius 2 (ECFP4-like), bit vector | 2048 |
morgan3 | Morgan circular FP radius 3 (ECFP6-like), bit vector | 2048 |
morgan2_count | Morgan radius-2 count vector | 2048 |
rdkit | RDKit path-based FP, bit vector | 2048 |
maccs | MACCS 167 structural keys (bit vector, --nbits ignored) | 167 |
topological | Topological torsion FP (count vector, hashed to --nbits) | 2048 |
atompair | Atom-pair FP (count vector, hashed to --nbits) | 2048 |
layered | Layered substructure FP, bit vector | 2048 |
pattern | SMARTS pattern FP, bit vector | 2048 |
Single SMILES, output as NumPy array (.npy):
uv run <skill_path>/scripts/rdkit_helper.py fp \
--smiles "CCO" \
--type morgan2 \
--output /tmp/CCO.morgan2.npy
From CSV, Morgan ECFP4 (2048 bits):
uv run <skill_path>/scripts/rdkit_helper.py fp \
--file data.csv \
--smiles-col smiles \
--type morgan2 \
--nbits 2048 \
--output data.morgan2.npy
From SMI, MACCS keys (always 167 bits):
uv run <skill_path>/scripts/rdkit_helper.py fp \
--file molecules.smi \
--type maccs \
--output molecules.maccs.npy
Output as CSV (smiles + bit_0 … bit_N-1 columns):
uv run <skill_path>/scripts/rdkit_helper.py fp \
--file data.csv \
--type rdkit \
--nbits 1024 \
--format csv \
--output data.rdkfp.csv
Atom-pair fingerprint, 4096 bits:
uv run <skill_path>/scripts/rdkit_helper.py fp \
--file data.csv \
--type atompair \
--nbits 4096 \
--output data.atompair.npy
3) List available descriptors
List all descriptors and built-in presets:
uv run <skill_path>/scripts/rdkit_helper.py list-desc
List descriptors in a specific preset group:
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group lipinski
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group physchem
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group topological
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group all
Descriptor Presets Reference
| Preset | Count | Typical Use |
|---|---|---|
lipinski | 6 | Quick drug-likeness screening (Ro5 filter) |
physchem | 25 | General ML features: MW, logP, TPSA, ring counts, charge stats, … |
topological | 56 | Graph/topology indices: Balaban J, Kappa, Chi, PEOE_VSA, EState_VSA, … |
all | ~200 | Full RDKit descriptor set (includes fragment counts, MQN, etc.) |
Output Format Notes
desc output (CSV):
- Columns:
smiles, then one column per descriptor. - When
--fileis a.csvand--no-mergeis not set, original CSV columns are appended. - Rows only contain valid SMILES (invalid ones are logged to
*.skipped.csv).
fp output:
.npy(default): NumPy array of shape(N_valid, nbits), dtypeuint8(bit) orint32(count)..csv:smilescolumn followed bybit_0…bit_{nbits-1}columns.- MACCS keys always produce 167 bits regardless of
--nbits.
Agent Checklist
When using this skill for users:
- Confirm input format:
.csvrequires a SMILES column (defaultsmiles).smiuses the first token of each line as SMILES
- Quote SMILES containing special shell characters (brackets/parentheses):
- Example:
--smiles "[C@@H](O)(F)Cl"
- Example:
- For CSV workflows, verify column names:
desc:--smiles-colfp:--smiles-col
- Choose the right preset or fingerprint type for the downstream task:
- Drug screening / Ro5:
--preset lipinski - General ML featurization:
--preset physchemor--type morgan2 - Structural similarity search:
--type morgan2or--type rdkit - Substructure matching:
--type maccsor--type pattern
- Drug screening / Ro5:
- Watch for skipped SMILES:
- Check
*.skipped.csvand decide whether to fix or permanently drop them
- Check
- Always capture absolute output paths:
- Look for
[RESULT] ...=/abs/pathin stdout
- Look for
- If debugging is needed, enable full traceback:
RDKIT_HELPER_TRACE=1 uv run <skill_path>/scripts/rdkit_helper.py ...
References
- RDKit documentation: https://www.rdkit.org/docs/
- RDKit descriptor list: https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors
- RDKit fingerprint guide: https://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity