The official OpenMM scripts generate the trajectories that we featurize with PyEMMA:
simulatePdb.py: minimal input to validate the pipeline with a pentapeptide demo.simulateAmber.py: production with prmtop/inpcrd for the protein-ligand complex.simulateGromacs.py and simulateCharmm.py: multi-format support when the system comes from other packages.All trajectories are converted to DCD and paired with energy reports to verify stability before discretization.
We denote by $C_{ij}$ the count of transitions from microstate $i$ to $j$ at lag time $\tau$. The normalized transition matrix estimate
\[T_{ij} = \frac{C_{ij}}{\sum_j C_{ij}}\]is the core of the model. Its eigenvalues $\lambda_k$ give timescales via
\[\tau_k = -\frac{\tau}{\ln \lambda_k},\]and allow interpreting slow dynamics as near-invariant processes. The flux density is analyzed by combining stationary probabilities $\pi_i$ and PCCA transitions, while observables $(O)$ are evaluated as
\[\langle O \rangle = \sum_i \pi_i O_i.\]The Chapman-Kolmogorov test ensures that $T(\tau)^n \approx T(n\tau)$ within statistical error.
PyEMMA exposes a config object that controls logging, progress bars, trajectory caching, parallelism, and even automatic version checks. Use it interactively (see runtime configuration) to peek at the defaults and tweak them:
Every attribute listed under pyemma.util._config.Config in the API index can be inspected (e.g., .mute, .traj_info_max_entries, .use_trajectory_lengths_cache) and saved for reproduction.
Save your tuned configuration with config.save("/path/to/pyemma.cfg"), then point PYEMMA_CFG_DIR at the directory that contains the file so every import reads the same settings. The configuration loader searches:
./pyemma.cfg$HOME/.pyemma/pyemma.cfg~/pyemma.cfg$PYTHONPATH/pyemma/pyemma.cfg (always the default)This lets you version-control different logging settings (the .logging_config attribute points to the logging.yml file) and reuse them across the simple and complex notebooks.
PyEMMA honors PYEMMA_NJOBS and OMP_NUM_THREADS (along with SLURM_CPUS_ON_NODE when available) to cap cluster usage. The tutorials for MSM estimation and clustering spawn both threads and processes, so set PYEMMA_NJOBS before launching Jupyter if you need to limit the total worker count.
pyemma.coordinates: handles trajectory loading, caching, and transformation pipelines (see pyemma.coordinates.api). Always feed it the same normalized data used in the notebooks so the coordinate cache is coherent.pyemma.msm: powers the transition matrix, MSM, and PCCA/TPT routines. The methods there (estimate_markov_model, .timescales, .pcca) are referenced in both scripts, so compare their docstrings with the API page to understand additional arguments (lag time, reversible, etc.).pyemma.analysis and pyemma.plots: supply helpers for implied timescale plotting, ITS, and Chapman-Kolmogorov validation; follow the API links when you extend the example workflows.The official tutorial demonstrates the same featurization → tICA → clustering → MSM steps that the pentapeptide example covers. Pay attention to the explanation of pyemma.coordinates.source (for loading DCD/PDB pairs) and the bootstrap/ITS sections so you can mirror the same plots without errors.
For supplementary recipes and worked examples on MSM construction and visualization, see the msmhelper tutorials <https://moldyn.github.io/msmhelper/tutorials/>__ provided by MolDyn; they revisit the same ideas (featurization, dimension reduction, clustering, observables) and occasionally feature alternative observables you can compare against the PyEMMA output.
The advanced tutorial (“Protein-ligand transitions”) walks through building MSMs on larger datasets with pyemma.msm.markov_model and pyemma.plots.plot_cktest. Use it to validate the complex notebook’s clustering parameters and to create the same diagnostic figures with the stored transition matrix.
We now follow the official PyEMMA 00 – Showcase pentapeptide: a PyEMMA walkthrough (PyEMMA 2.5.5+9.ge9257b9.dirty) as our guided notebook. The walkthrough loads the pentapeptide dataset from mdshare, adds backbone torsions/positions/distances, scores the feature sets with VAMP2, applies TICA, clusters the projected data, builds a Bayesian MSM, validates it, and finally inspects PCCA/TPT, observables, and manuscript-quality figures.
PyEMMA internally inspects mdtraj.version.version. When you install MDTraj 1.9+ the mdtraj module does not expose a version attribute, which triggers AttributeError. The first cell in the notebook (and our helper script) patches the module so PyEMMA can access the already-installed version module:
import importlib
import mdtraj
if not hasattr(mdtraj, 'version'):
mdtraj.version = importlib.import_module('mdtraj.version')
import mdshare
import pyemma
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')
Three feature groups are built with pyemma.coordinates.featurizer: backbone torsions (with cossin expansion), backbone heavy atom positions, and backbone-heavy atom distances, all computed with periodic=False after a global alignment that removes the simulation box.
Each feature list is ranked with a cross-validated VAMP2 score that splits the 25 trajectories into training/validation, fits pyemma.coordinates.vamp, and scores the held-out half. Across lags of 0.1–0.5 ns the backbone torsions consistently return the highest kinetic variance, so the notebook continues with that featurization.
Time-lagged independent component analysis compresses the torsion data into a few kinetic components, and k-means discretizes that low-dimensional projection into microstates.
Time-lagged Independent Component Analysis (TICA) pulls out the kinetically slow degrees of freedom by searching for linear combinations of features with maximum time-lagged autocorrelation; the resulting ICs (IC1, IC2…) form the low-dimensional space where MSM transitions are best resolved.
The torsion-based data are projected with pyemma.coordinates.tica(..., lag=5) (0.5 ns) using the default kinetic map scaling and enough components to capture 95% of the kinetic variance. Histograms and densities on the first four ICs show well-separated metastable basins, and one trajectory is plotted over time to illustrate discrete jumps.
k-means clustering is applied to the TICA coordinates. A VAMP2 scan over [5, 10, 30, 75, 200, 450] cluster centers reveals saturation near k=75, which is therefore fixed for the remainder of the tutorial (fixed_seed=1 for reproducibility). The centers and discrete trajectories are shown against the first two TICA dimensions.
Implied timescales are computed with pyemma.msm.its(cluster.dtrajs, lags=50, nits=10, errors='bayes') and plotted in ns. The timescales converge above 0.5 ns, which justifies a lag time of 5 steps for pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=5, dt_traj='0.1 ns'). Active-state and active-count fractions are printed, and a Chapman–Kolmogorov test with mlags=6 over five metastable sets confirms Markovian consistency.
The Chapman–Kolmogorov (CK) test checks whether the transition probabilities propagate consistently across multiples of the lag time. By comparing estimates obtained at kτ directly to the k-step propagation of the original lag-τ transition matrix, we verify the Markov assumption: a passing CK test indicates the kinetics remain invariant under coarse-graining, lending confidence to the MSM’s lag choice.
The five curves sit on top of each other across all mlags, so raising the lag-5 matrix to powers faithfully reproduces the directly estimated longer-lag transition matrices; this strong agreement confirms τ=0.5 ns is already outside the memory regime and the implied timescales capture real slow processes rather than being influenced by the discretization.
The sample mean/std of the first 15 timescales are plotted together with their separations, showing the gap between the 4th and 5th process. Stationary distributions, free energies, and the first four right eigenvectors are visualized as contour plots on the first two TICA coordinates, revealing how slow processes connect the dense clusters.
msm.pcca(5) produces fuzzy memberships that are contoured across IC1/IC2, and the argument-wise argmax provides crisp metastable assignments. Sample structures are saved via pyemma.coordinates.save_traj and visualized with nglview, while the tutorial also prints state free energies, MFPT tables, and direction-specific MFPTs, showing metastable state 1’s short lifetime. Transition path theory between states 2 and 4 uses pyemma.msm.tpt, and the committor is plotted as a contour map.
Transition Path Theory (TPT) is the mathematical framework that uses these metastable sets to define reactive pathways, generate committor probabilities, and compute net fluxes between states. The committor tells us the chance that a trajectory starting in a configuration reaches the designated product set before returning to the reactant set, while the reactive flux highlights which microstate transitions actually carry probability current under equilibrium assumptions1. Together they transform the MSM into a pathway-level picture of the slow dynamics and guide which states to sample for experimental observables.
The committor gives the probability that a configuration will reach a specified final macrostate before returning to the initial macrostate, so contour maps of the committor reveal the transition pathways’ location, while the reactive fluxes computed by TPT count the net probability current between states. Examining both the committor surface and the fluxes ensures we understand the preferred transition routes in addition to the equilibrium populations1.
Samples of 20 frames per Markov state are converted to MDTraj objects for SASA (shrake_rupley) and radius of gyration (compute_rg). The Bayesian MSM evaluates ensemble expectations, standard deviations, and confidence intervals for the radius of gyration, while Trp-1 SASA is projected on the TICA plane to produce contour plots, autocorrelation functions, and relaxation curves for ML and Bayesian MSMs. Comparing metastable state expectations against the global average highlights state 1 as the most distinct ensemble.
The notebook notes that hidden Markov models, initialized from the PCCA macrostates, further reduce discretization error and provide a compact kinetic description; readers are directed to Notebook 07 for hands-on HMM exercises.
The final chapters reconfigure Matplotlib defaults and reproduce Figures 2–6 from the Showcase notebook, including the system overview, ITS + CK panels, free-energy and state maps, flux sketches, and Trp-1 autocorrelation/relaxation plots. Saved PDFs live in data/figure_[2-6].pdf. A Wrapping up section gestures to the remaining notebooks for deeper theory, TPT, HMMs, observables, and debugging tips.
Use PyEMMA Tutorial 1: simple MSM workflow to cross-check the coordinate pipelines and diagnostics before moving on to the protein complex.
mdshare’s pentapeptide dataset.Before the session ends, launch a project that covers the entire PyEMMA workflow and submit a ZIP archive containing the script/notebook you ran, any required trajectory/topology files (or download instructions), and the key plots that demonstrate your MSM analysis. Choose one of the following:
Deliverables:
The goal is to get comfortable with the full simulation‑to‑MSM pipeline and produce evidence (scripts + plots) you can share by this afternoon.