GCMC titration and grand canonical integration

Here, we'll go through the basics of setting up running a GCMC simulation to calculate the absolute binding free energy for a small network of waters. The example we use is a small binding site in BPTI that can bind 3 water molecules, which has been taken from the recent ProtoMS paper: G. A. Ross et. al, J. Am. Chem. Soc., 2015. Although it's a small site, exactly the same procedure can be followed to calculate the binding free energy of a large network.

To come back to the tutorial index, click here.


The protein has already been protonated and the atoms names have translated to the ProtoMS naming scheme. You can read more about setting up structures here.

Simple setup

Before running GCMC, you must carefully select the region where you'd like to perform GCMC. During the simulations, waters will be inserted and deleted within that region, and the free energy to add or remove waters can be calculated using grand canonical integration. Therefore, you should make sure the GCMC region covers the volume you're interested in.

In protein_pms.pdb, a box centred at (32.0, 7.0, 2.0) whose sides have lengths (3.5, 4.0, 8.0) encompasses a small cavity. The aim is to use GCMC to calculate the total affinity of water for that site. We can create a box to these specifications by typing

python $PROTOMSHOME/tools/make_gcmcbox.py -b 32.0 7.0 2.0 3.5 4.0 8.0 -o gcmc_box.pdb

This creates a file called gcmc_box.pdb while outputting:

Volume of GCMC box: 112.0
Bequil: -9.15

We need to know the volume of the GCMC box later for analysis, but will use Bequil to set up the simulation. Bequil is calculated using:

Bequil = βμ'hyd + ln(V/Vo),

where Bequil is the Adams value that produces the equilibrium number of waters for the system. μ'hyd is the excess chemical potential of bulk water, which is the hydration free energy of a single water molecule. In ProtoMS, previous analysis has found that μ'hyd equals -6.2 kcal/mol. V here is the volume of the gcmc region and Vo is the volume occupied by a single water molecule at bulk density and has a value of 30 3. As the region specified by gcmc_box.pdb has a volume of 112 3, the above equation gives Bequil to be -9.15.

The box we've just created and the small cavity in the protein look like:

The figure shows a slice through the surface of the protein, and gcmc_box.pdb encompassing the small, sock-like cavity we're interested in.

As we wish to completely bind water to the volume specified by gcmc_box.pdb, we must run a series of GCMC simulations at different chemical potentials (Adams value in ProtoMS), within which the average number of inserted waters ranges from 0 to the equilibrium number of waters for the system. As we know Bequil is -9.15, we want to chose a range of waters which encompasses this value towards its higher end. We will chose a range of -6.8 to -29.8. As we would like this to run on a single, 16 processor node, we will run 16 B values between these two points. If you would like to set this simulation up for a different machine, change the value to 12 or 24, or any appropriate value.

python $PROTOMSHOME/protoms.py -s gcmc -sc protein_pms.pdb --gcmcbox gcmc_box.pdb --adamsrange -29.8 -6.8 16
This has automatically solvated our protein in a droplet of water (water.pdb) by randomly placing waters up to bulk density. Any solvent water that was placed inside gcmc_box.pdb has been removed to create water_clr.pdb.

For this system, convergence can sometimes be difficult. For this reason, open the run_gcmc. cmd file in a text editor. The last line should start with "chunk simulate 40000000". Change this to say "chunk simulate 80000000", which will double the number of production steps. Sa ve the changes before starting the simulation.

To run the simulation, you need 16 cores (or whatever number of B values you have chosen) and MPI. Execute by typing
mpirun -np 16 $PROTOMSHOME/protoms3 run_gcmc.cmd

It is vital that you check the simulations by eye. If you have Pymol, you can check the output structures from one of the GCMC simulations by typing
 pymol out_gcmc/b_-8.333/all.pdb 
Check your warning files as well to make sure nothing untoward has happened.

Before calculating occupancies and free energies with grand canonical integration, we should check to see if the simulations are approximately equilibrated. For one simulation, we can see the average number of inserted GCMC waters for each snapshot by typing

python $PROTOMSHOME/tools/calc_series.py -f out_gcmc/b_-8.333/results -s solventson 

This shows the average number of waters in the simulation at B = -8.333. The system is well equilibrated at 3 waters in this simulation. This would have produced an estimate for the start of the equilibrated period. Check to see if that value matches what you see in the graph that gets automatically plotted. This plot looks well equilibrated from the beginning, but this may vary between repeats and different B values. Look at a few other B values and check if they are equilibrated around a similar point.

For the rest of this analysis, we'll focus on the script calc_gci.py which contains a lot of functionality. With this script, we can see both how the number of waters changes with the chemical potential, and how the binding free energy of the network changes with the number of water molecules. We run this script with:

python $PROTOMSHOME/tools/calc_gci.py -d out_gcmc -v 112.0  
where the -v flag is used to specify the volume of the GCMC region, which is required for the free energy calculation. Further, depending on the observation made with calc_series.py, a portion of the simulation can be discarded as equilibration with the -l flag. One of the plots will look like the following:

This titration plot shows how the average number of water molecules observed varies with the Adams value of each replica. You will likely notice that the increases in the number of water molecules occur in 'steps', which is characteristic of GCMC titrations. The area under this curve can be used to calculate the binding free energy of the water network, as described by Ross et al., JCTC, 2017. An important point is that these free energies calculate will include any cooperative interactions between the water molecules in the network. If the curve automatically fitted to the plot does not appear suitable, the complexity of the function used can be controlled manually with the -n flag, supplying to this the number of 'steps' that you observe in the plot.

The free energy data calculate will be printed to the screen and will look something like:

All free energy quantities are given in kcal/mol

Number of Waters   Insertion Free Energy   Network Binding Free Energy   Water Binding Free Energy   
0                    0.000 +- 0.000          0.000 +- 0.000                0.000 +- 0.000         
1                  -17.775 +- 0.000        -11.575 +- 0.000              -11.575 +- 0.000         
2                  -29.450 +- 0.000        -17.050 +- 0.000               -5.475 +- 0.000         
3                  -36.677 +- 0.000        -18.077 +- 0.000               -1.027 +- 0.000         
4                  -42.349 +- 0.000        -17.549 +- 0.000                0.528 +- 0.000 

The first column lists the different numbers of water molecules sampled in the simulation and shows how the various thermodynamic parameters changes as the number of water molecules increases. The second column contains the insertion free energy of a given number of water molecules, which corresponds to the transfer of the waters from an ideal gas to the GCMC region. The third column shows the binding free energy of the network, which is the insertion free energy, with the free energy penalty associated with removing the water molecules from bulk solvent accounted for (thus corresponding to a more physical measure). Finally, the fourth column shows how the network free energy is affected by the addition of each water molecule. It should be noted that the results quoted above have no errors associated as the simulations were not repeated here (in practice, they should be).

As can be seen in the table above, the first three water molecules stabilise the network (as the binding free energy decreases). Interestingly, the fourth destabilises the network, with a binding free energy of +0.528 kcal/mol in this example, and therefore does not likely bind. This analysis is reflected in the rest of the output to the screen:

The equilibrium B value is -9.147
Most similar simulated B value is -9.867
Occupancy at -9.867 is    2.6487 +/- 0.0000

Occupancy at binding PMF minimum is 3.000

This shows that the most stable number of water molecules is 3, and compares this to the number of water molecules observed at the B value closest to equilibrium.

The information discussed above is also displayed in a graph like the one below:

Here it is clear that the free energy minimum corresponds to 3 water molecules, and we can more easily see the slightly destabilising effect of the fourth water molecule.

Advanced Analysis

As already discussed, there are many options that can be passed to calc_gci.py, allowing a significant amount of control over the analysis performed. Some of these are briefly discussed in this section.

If multiple repeats of a simulation have been carried out (this is highly recommended), these can be analysed together using a wildcard:

python $PROTOMSHOME/tools/calc_gci.py -d out_gcmc* -c 112.0

This will result in better free energy results, with more reliable statistics. Additionally, the titration plot will show the data points from each of the repeats in a different colour, in order to identify if there are consistency issues between repeats of the same simulation.

If you are only interested in a small portion of the titration curve, you can analyse only the free energy change over a given range. For example, to analyse the region between 2 and 3 water molecules, type the following

python $PROTOMSHOME/tools/calc_gci.py -d out_gcmc -v 112.0 --nmin 2 --nmax 3 -n 1

A similar analysis to that discussed previously will be printed to the screen, focusing only on the specified region.

The full list of options available with the calc_gci.py script can be found using the -h flag.

Written by Gregory A. Ross, 2015. Revised by Hannah Bruce Macdonald, Jan 2017