Week 8: Electronic Structure and Computational Chemistry II

Overview and Objectives

In this week we will learn about the basic concepts of molecular dynamics (MD) simulations, and learn how to run MD simulations in Python.

Molecular dynamics simulation is broadly used to study a great diversity of compounds and states of matter, and one especially significant application area is biomolecular systems, in particular proteins and their chemical environments (lipid membranes, carbohydrates, nucleic acids, water, ions, etc.) Similar to quantum chemistry, MD simulation is a field in itself, with hundreds of scientists in academia and industry dedicated to the development of MD methods and their application to chemistry and biochemistry problems.

Generally speaking, MD simulation is concerned with simulating larger systems for longer times compared to the systems we studied last week. It also involves the generation and processing of large data sets. Learning how to run MD simulations and process their results requires combining the knowledge that you learned in all the previous weeks of the course: probability and statistics, integrating differential equations, cheminformatics, and working with external libraries using their APIs. Therefore, in a way MD is a nice “capstone subject” that summarizes much of what we’ve learned in this course.

Background

Put simply, MD simulates the motions of atoms and molecules using Newton’s second law. We first define a potential energy function of the atomic positions that represents the energy of an electronic state, usually the ground state; the energy is assumed to be a function of the nuclear positions only and unaffected by the nuclear velocities (this is called the Born-Oppenheimer approximation). Given the potential energy function of a molecular system, and a set of initial atomic positions and velocities, the atomic forces are calculated as a function of the positions. Newton’s second law converts the atomic forces into accelerations, which in turn are used to calculate the atomic positions and velocities at a time step into the future (the size of the time step is usually around 1 fs, or 10^-15 s). The positions and velocities are then updated, and the procedure is repeated a large number of times to produce a trajectory that represents the simulated dynamics of the system over a time interval. Therefore, to simulate a 1 microsecond trajectory, it takes about one billion force calculations.

Force Fields

The vast majority of MD simulations use an empirical model of the potential energy surface called a force field. Force fields are millions of times more efficient than quantum chemistry, which allows them to be applied to much larger system sizes and time scales. MD simulations that use force fields are often called “classical MD”, and they are routinely applied to simulate systems containing tens of thousands of atoms, such as proteins in explicit solvent, on timescales of microseconds or even longer. On the other hand, force fields are not based on “first principles” but instead are based on assumptions about the functional form of the potential energy surface together with fitted parameters. Therefore, classical MD simulations have more limitations in what systems, transformations, and conditions can be simulated.

There are many functional forms that one could use to model the potential energy surface, but the most widely used kind is rather simple and intuitive. The typical functional form of a force field can be broken down as follows:

\[ E = E_{\mathrm{bond}} + E_{\mathrm{angle}} + E_{\mathrm{dihedral}} + E_{\mathrm{nonbonded}} \]

where the four pieces represent bond stretching, angle bending, dihedral angles, and nonbonded interactions. The bond stretching and angle bending terms look like:

\[ E_{\mathrm{bond}} = \sum_{(i,j) \in \mathrm{bonds}} \frac{1}{2} k_{ij} (r_{ij} - r_{ij}^0)^2 ; \quad E_{\mathrm{angle}} = \sum_{(i,j,k) \in \mathrm{angles}} \frac{1}{2} k_{ijk} (\theta_{ijk} - \theta_{ijk}^0)^2 \]

In these two equations, the sums go over all of the bonded atom pairs, and bond angles in the system respectively. The potential energy for an individual bond stretch / angle bend is modeled using a harmonic oscillator, which is valid for small displacements around the equilibrium value. \( r_{ij} \) and \( \theta_{ijk} \) are the values of the bond length and bond angle calculated from the current atomic positions. \( r_{ij}^0 \) and \( \theta_{ijk}^0 \) are equilibrium bond and angle parameters, and stay constant throughout the simulation. Likewise, \( k_{ij}, k_{ijk} \) are force constant parameters; they were likely fitted by the researchers who published the force field by reproducing molecular geometries and vibrational frequencies from QM calculations.

At this point we should be aware of one important difference with QM calculations: In order for the potential energy to be defined we need to know which atoms are bonded. That information is contained within the topology, which is the list of the atoms and chemical bonds in the system. (Remember that in QM, it is through solving Schrodinger’s equation that we get all of the properties, without assuming which atoms are bonded.)

The dihedral energy terms go as:

\[ E_\mathrm{dihedral} = \sum_{i,j,k,l \in \mathrm{dihedrals}} \sum_{n=1}^6 k_{ijkl,n}(1+\cos(n\phi_{ijkl} - \phi_{ijkl,n}^0)) \]

The sum goes over the dihedral angles in the system, which includes all bonded atom quartets connected as \( i-j-k-l \). These energy terms model the steric and electronic contributions to the potential energy as the groups \(i\) and \(l\) are rotated about the central bond \(j-k\). Unlike bond stretching and angle bending, the dihedral potential energy is periodic in the dihedral angle \(\phi_{ijkl}\), and is given by a sum over \(n\)-fold sinusoidal components, where the parameters \(k_{ijkl,n}\) and \(\phi_{ijkl,n}^0\) give the amplitude and phase shift of each component. Dihedral terms can also be added to improper dihedral angles, which involve other atom quartets such as three atoms bonded to a central one; these are intended to model out-of-plane distortions of systems that are planar at equilibrium, such as aromatic rings.

Finally, the nonbonded energy goes as:

\[ E_\mathrm{nonbonded} = \sum_{i,j \in \mathrm{nonbonded}} 4 \epsilon_{ij} \left( \frac{\sigma_{ij}^{12}}{r_{ij}^{12}} - \frac{\sigma_{ij}^{6}}{r_{ij}^{6}}\right) + \frac{q_iq_j}{4\pi \epsilon_0 r_{ij}} \]

This term describes intermolecular and intramolecular non-bonded interactions, and the sum goes over all non-bonded pairs of atoms that are at least three bonds apart. The first part is called the Lennard-Jones potential and models both the dispersion forces that exist between all pairs of non-bonded atoms, as well as the strong repulsive forces when the atoms approach too closely. The \( r^{-6} \) term has its origins in the quantum mechanical theory of dispersion interactions, but the \( r^{-12} \) term is a computational convenience that does not model the short-range repulsion very well. The parameters \( \sigma_{ij} \) and \( \epsilon_{ij} \) represent the “sum of atomic radii” and “well depth” respectively.

(pic of LJ potential)

The second part of the nonbonded term is the electrostatic interaction, and consists of Coulomb interactions between point charges on nonbonded atom pairs. The point charges \( q_i, q_j \) are again parameters. Commonly, atom pairs that are exactly three bonds apart have their nonbonded interactions scaled by a factor of \( <1 \) because they are too close to be purely described by a nonbonded interaction and there is some overlap with the dihedral energy term.

So in principle, if you are given a set of initial coordinates, a topology, and a complete set of parameters for every individual interaction in the system, you’re off to the races. But in practice a typical biomolecular system contains tens of thousands of atoms, which leads to hundreds of thousands of interactions or more. Therefore, the standard inputs to a classical MD simulation consists of the following components:

(1) The user needs to come up with the initial coordinates and the topology, which could be taken from the Protein Databank in .pdb format, or generated by modeling software, or a combination of both. The atoms in a topology are grouped into residues which are individual molecules or units of a polymer (e.g. amino acid residues in a protein). Each atom is labeled by the sequential residue number and residue name that it is part of, and the sequential atom number and atom name within the residue.

(2) The force field definition (often just called the “force field”) that comes with the MD simulation software contains a number of pre-defined residue templates that contains the list of atoms in the expected residue, how the atoms are bonded, and the parameters to be assigned to each interaction. When the topology is read in, each residue is matched up with a template defined in the force field, and the individual bond, angle, dihedral, and nonbonded parameters are assigned accordingly. In practice, each atom in a residue template may have an atom type assigned to it (the atomic symbol plus its local chemical environment, such as an \( sp^2 \) carbon), and the non-electrostatic parameter values are looked up from a dictionary where the keys are one or more atom types. The electrostatic parameters tend to be defined individually for atoms in a residue template, so they are more fine-grained than the others.

So in summary, to start a classical MD simulation, the user needs to have the initial coordinates and the topology, and a choice of force field. The parameters are assigned to individual interactions by matching up the residues in the topology with the templates in the force field definition, and this is called parameterizing the system. On the other hand, if a force field does not contain a residue template for a molecule (for example, if you wish to simulate a candidate drug molecule that the force field does not recognize), it requires deriving the template and parameters for that molecule, which is a whole another subject. We will leave that alone for now.

Initial conditions

In order to start a simulation, the initial positions of the atoms are required. For a protein this could come from the .pdb file (remember that a .pdb file contains the coordinate and topology information of the atoms in the system), but there can be many situations where this information is missing. For example, a .pdb file may be missing the positions of key residues that are disordered in the crystal, or you may want to simulate a protein that doesn’t have a deposited structure. You might want to simulate a material or a liquid that does not have a deposited structure. In these situations, the missing parts of the structure needs to be modeled. Liquids are easy to simulate as they are unstructured, and several programs exist for “packing” molecules randomly into a cubic box, e.g. PackMol or genbox (part of Gromacs). For simulations in aqueous solvent, protein simulation packages contain methods for modeling a large number of explicit water molecules around the protein structure. Non-crystalline solids and interfaces are much more difficult than liquids, and the modeling will depend strongly on the application. Proteins that are missing part or all of the structure can be modeled using software packages such as Rosetta, which uses statistical distributions of structures from the Protein Databank (in addition to other sources) to generate reasonably accurate structural models; this is also an active field of research.

The initial velocities are much more simple; these are almost always modeled using the Maxwell-Boltzmann distribution, which is the theory of how atomic velocities are distributed in a classical ideal gas. It turns out that the M-B distribution is valid for any classical system of particles regardless of their interactions. The distribution is a Gaussian for each Cartesian direction that becomes wider at higher temperatures, and the initial atomic velocities are assigned by drawing random samples from the Gaussian.

Integration method

The integration method is the algorithm for how to generate the future positions and velocities from the current positions, velocities, and accelerations. At first glance this might seem to be very simple because it’s just Newton’s Second Law. Actually, there are multiple ways to update the positions and velocities, with major differences in their accuracy. The accuracy of an integration method can be defined in terms of how closely a trajectory using a finite time step (e.g. 1 fs) matches the ideal limit of using infinitely small time steps, or more commonly, how well the algorithm conserves the total energy.

The simplest integration method is called “forward Euler” and involves the following:

1) At the current positions, evaluate the force \(F(x)\), then compute the acceleration as \( a = F(x)/m \).

2) Compute the next set of positions as \( x_+ = x + v(\delta t) + \frac{1}{2} a (\delta t)^2\)

3) Compute the next set of velocities as \( v_+ = v + a(\delta t)\)

4) Step forward in time by setting \( x = x_+; v = v_+ \) and go back to step 1.

Unfortunately, forward Euler is a terrible method, and does not conserve the energy well, even for systems with analytic solutions such as the harmonic oscillator or the 2-body problem. It is easy to show that the energy will steadily increase and eventually diverge using this integrator. A much better way of integrating the equations of motion is the velocity Verlet approach and it involves only a simple modification of the above algorithm:

1) At the current positions, evaluate the force \(F(x)\), then compute the acceleration as \( a = F(x)/m \).

2) Compute the next set of positions as \( x_+ = x + v(\delta t) + \frac{1}{2} a (\delta t)^2\)

3) Compute the next set of accelerations as \(a_+ = F(x_+)/m \)

4) Compute the next set of velocities as \( v_+ = v + \frac{1}{2}(a + a_+)(\delta t)\)

5) Step forward in time by setting \( x = x_+; v = v_+ ; a = a_+ \) and go back to step 2.

The key improvement in velocity Verlet is that the velocity update is computed by taking the averaged acceleration between the current and future time, instead of just the current time. That way, it takes into account the changes in the force / acceleration over the time step. Using velocity Verlet, it is possible to achieve very good energy conservation and it is a commonly used approach in research-grade simulations.

There are a few other MD integration methods in the literature, and many of them are based on velocity Verlet. In fact, it is possible to use any numerical integration algorithm for ODEs (such as Runge-Kutta) and apply it to MD simulations. The reason why this isn’t done is because it is desirable for MD integrators to possess some of the same properties of classical dynamics, such as being time-reversible and preserving the volume of phase space (an integrator with this property is called symplectic). Research into MD integration methods, primarily for the purpose of taking larger time steps (thus enabling the simulation of longer time scales) is an ongoing area of research.

Periodic boundary conditions

Most MD simulations are performed on systems in the condensed (i.e. not gas) phase. This presents a problem because the simulation can only handle a finite number of atoms (usually less than 100,000), and an isolated system of that size would essentially be a “nano-droplet” with very different properties compared to aqueous solution (primarily because the solvent structure will change at the vacuum boundary and affect the interior). The standard way to handle this in MD simulation is by using periodic boundary conditions, in which the atomic positions are defined in a “unit cell” and the positions are replicated infinitely along all three lattice vectors. This unit cell may contain a protein surrounded by 10,000 water molecules, or a liquid consisting of 1,000 or more molecules.

Computational tricks make it relatively straightforward to evaluate intermolecular interactions across cell boundaries. Typically one needs to define a “cut-off” distance for Lennard-Jones interactions, which is around 9 Angstrom (0.9 nanometers) in practice. Electrostatic interactions, on the other hand, are long-range and do not work well with cut-offs. The usual approach for electrostatics is the “particle-mesh Ewald” method that involves splitting up the electrostatic interaction into a short range part that can be evaluated with cutoffs, and a long range part that could be Fourier-transformed and summed using a cutoff in reciprocal space. The “particle-mesh” part comes from a step in which the atomic charges are interpolated onto a grid before taking the Fourier transform.

Temperature and pressure control

A good MD integrator such as velocity Verlet will preserve the total energy of the system (KE + PE). However, the systems that we’re actually interested in simulating are usually at constant temperature (NVT) or constant temperature and pressure (NPT). A system at constant temperature does not have a constant total energy, but instead the total energy follows a probability distribution called the Boltzmann distribution. MD simulations at constant temperature will use an algorithm called a “thermostat” to ensure that the energy obeys the Boltzmann distribution over long periods of time. The flip side is that the thermostat always adjusts the total energy by perturbing the velocities, which can adversely affect the MD simulation’s ability to simulate the time-dependent properties of the system. Thermostats have a set point which determines the average kinetic energy that the atoms should attain; at any instant, the instantaneous temperature may be computed from the atomic velocities as: \[ T_\mathrm{instant} = \frac{2 \cdot \mathrm{KE}_\mathrm{avg}}{3 k_B} \] The instantaneous temperature will not be exactly equal to the set point, but over long times it will reach a distribution that averages to the set point.

Perhaps the simplest thermostat is called the Andersen thermostat where the velocities are simply reset to random values taken from the Maxwell-Boltzmann distribution at regular intervals (for example every 1000 time steps). Because this method can adversely affect the kinetic properties of the system, a number of improvements have been proposed; for example, the Bussi-Donadio-Parrinello thermostat will randomly change the magnitude of the velocities but not their directions, resulting in better reproduction of kinetic properties. A Langevin thermostat achieves the Boltzmann distribution by way of a random force (which tends to speed up particles at rest) and a friction force (which tends to slow down fast particles). There are several other thermostats out there; it is important to mention the Berendsen thermostat, which gives the correct average temperature but not the right distribution. Therefore it is not recommended to use the Berendsen thermostat in production simulations.

At constant pressure, the system volume is not constant but obeys a Boltzmann distribution. A barostat is an algorithm that adjusts the system volume according to the distribution, similar to how the thermostat adjusts the velocities. Barostats are relatively finicky compared to thermostats, and the mathematics of their implementation can be more complicated. The Monte Carlo barostat is a simple and relatively well-behaved barostat that randomly adjusts the system volume at regular intervals in order for the long-time simulation results to match the Boltzmann distribution.

Minimization, equilibration, production

When using MD simulations in research, the most common problem is that the system “blows up”. More subtly and far worse, the simulation might be defective in a small but significant way (e.g. a key hydrogen bond or salt bridge is broken in a protein) that can run normally and produce a large dataset, but due to the defect all of the results are questionable. Typically, a minimization and equilibration procedure is taken to reduce these problems, although it is not absolutely standardized across all researchers:

1) The initial coordinates are “relaxed” by energy minimization. This is important in order to reduce any high-energy interactions present in the original structure that may come from atoms being placed too close together in the experimental structure or by the modeling software. Optionally, some of the atoms (e.g. the protein atoms in the crystal structure) may be restrained to their original positions by a harmonic oscillator to prevent the minimization procedure from disrupting the structure too much.

2) The minimized coordinates are subject to an “equilibration” procedure in which the simulation is brought up to the target temperature (often this is 298 or 300 K). At its simplest, this is a NVT or NPT simulation that lasts for 10 ns or so, and its purpose is to get the minimized structure out of any high-energy local minima and assume a reasonable sample from the Boltzmann distribution. The system can be heated up in steps (for example, in ten steps of 30 K lasting 1 ns each), and the simulation may start with harmonic restraints to the original structure that are gradually weakened and taken away. At the end of this procedure, the instantaneous temperature should be close to the set point and the structure is assumed to be a typical random sample from the Boltzmann distribution.

3) The “production” simulation is started which involves collecting trajectory data at regular intervals; this is the main segment of the simulation and usually the most computationally demanding portion of the project. Because of the long simulation times (often more than 1 billion time steps), trajectory data is usually not saved at every time step, but at regular intervals (once every 50,000 time steps or so is common). The trajectory is usually saved into a high-efficiency compressed format to save space; despite this, typical trajectory files can reach into the tens or hundreds of GB.

Analysis of simulation results

The purpose of all MD studies is to simulate a property of interest and gain insights by studying the simulation trajectory. Often the initial step is to open the trajectory using a visualization program, such as VMD, and “play the movie” to get a sense for what has happened in the simulation. Sometimes, interesting processes can happen in the simulation such as a protein folding or a liquid freezing; other times, the main result comes from statistical analysis of the result, such as the distribution of distances between two key residues in a protein. Specialized trajectory analysis programs, such as cpptraj and mdtraj, are designed for high-performance analysis of trajectories, and they can be combined with plotting software such as Matplotlib to produce insights into these large data sets.

MD software (OpenMM in particular)

There are about a dozen widely used software packages for molecular dynamics simulations. Some of the most widely used ones include AMBER, CHARMM, GROMACS, OpenMM, TINKER, and LAMMPS. Each of these software packages have different licensing models which governs how one can obtain and share the programs and source code. With a few exceptions, MD software tends to be free for academic researchers to use.

OpenMM is a software package for MD simulation that we will feature in this course. It is an open-source package in that it is free to use, and the source codes are free to obtain and share. OpenMM is perhaps unique among the other software packages in that its primary user interface is an API that can be accessible via several programming languages including Python. The benefit of this approach is that the setting up, execution, and analysis of OpenMM simulations can all be done from within Python scripts or Jupyter notebooks. OpenMM is also a high-performance package that includes GPU implementations of the most computationally intensive parts, thus achieving much higher speeds than purely CPU-based simulations. A Jupyter notebook that demonstrates how OpenMM works will be provided as part of this unit.

When working with OpenMM, it is important to be aware of the unit system it uses: length is measured in nanometers (nm), time in picoseconds (ps), mass in amu / Dalton, and energy in kJ/mol.

Additional notes

There are countless ways in which MD can be extended beyond what is described here. I will touch on some of these directions very briefly.

An ab initio MD (AIMD) simulation is one that uses quantum chemistry in place of the force field, i.e. it uses quantum mechanics to compute the electronic wavefunction, potential energies, and nuclear forces, then uses Newton’s Second Law to propagate the atoms. AIMD has broader predictive power due to making fewer assumptions, but because it is millions of times more expensive, it generally cannot be used for the same applications as force fields. AIMD can be applied to study systems under extreme conditions out of reach of experiments, or to predict reactivity in systems where the reaction mechanism is unknown.

Perhaps confusingly, quantum dynamics is not the same as ab initio MD. These simulations take into account the quantum mechanical behavior of the nuclei. The time-dependent Schrodinger Equation is used for time evolution in place of Newton’s Second Law. Because time evolution under QM is much more complicated than classical mechanics, a number of approximations need to be made. One common quantum dynamics method is called path-integral molecular dynamics, which takes the form of running multiple copies of a MD simulation where the different copies of an atom are connected by harmonic oscillators in a loop; this approach is mainly used to study quantum effects on thermodynamic properties. Another branch of quantum dynamics aims to study the properties of electronic excited states, which can involve transitions between different states (breakdown of the Born-Oppenheimer approximation). Methods in this area have names such as surface hopping, Ehrenfest dynamics, and multiple spawning. Quantum dynamics simulations can either use quantum chemistry or force fields to describe the potential energy, i.e. the “quantum” in quantum dynamics refers purely to how the nuclear dynamics are treated.

Even though classical MD can simulate tens of thousands of atoms on microsecond or longer timescales, there are all sorts of interesting problems that are far beyond these limits (for example, simulating the formation of an amyloid fibril, important for understanding neurodegenerative disease, requires second timescales or even longer). To reach these longer timescales, researchers have developed all sorts of creative methods. One branch of method development focuses on enhancing the sampling efficiency by applying biases or random perturbations along specified degrees of freedom; these methods are given names such as umbrella sampling, metadynamics, and replica exchange. Another field of research involves coarse-graining the MD simulations so that individual particles no longer represent atoms but entire functional groups or even whole amino acids. There are also hybrid methods that combine experiment and simulation, for example by applying experimental signals as energy terms onto a simulation to force the results to be consistent with the signal. In short, the development of MD methods is a very active field with many promising applications to those who are interested.

Further Reading

(add a link to OpenMM documentation page)