Clustering frames with GROMACS

The gmx cluster command

included in category GROMACS

01/28/2023 01/28/2023 1769 words 9 minutes

Contents

Molecular Dynamics (MD) simulations are a powerful tool for understanding the behavior of biological systems, such as proteins, at the atomic level. However, one of the “issues” with MD simulations is that they can generate a large amount of data, making it difficult to identify key structural conformations for your system.

For this reason, GROMACS has a built-in clustering technique that can be used to reduce the amount of data generated by MD simulations and make it more manageable.

In this blog post, we will explore how clustering can be performed in GROMACS via the gmx cluster module to identify the most representative structures for your system that you can then use for further computational studies. I will also provide examples of how clustering can be used to analyze the trajectory and identify key conformations in your system.

Why use clustering for MD simulations?

A typical MD simulation can produce millions of frames, each one containing information about the positions and velocities of all atoms in the system at a specific point in time. This data can be difficult to analyze and interpret, especially for large or complex systems.

So how can you find the dominant conformation of the protein along the course of the simulation?

We can accomplish this via the so-called clustering analysis.

Clustering is a process used to group similar data points together in order to reduce the amount of data and make it more manageable. In the case of trajectories, clustering is used to reduce the large number of frames of a typical trajectory file to a smaller set of distinct frames that can be seen as an overall representation of the motion of the system.

It involves grouping similar frames together into clusters, with each cluster representing a distinct structural conformation.

By reducing the number of frames to a smaller set of representative structures, clustering can make it easier to identify key conformations that are important for understanding the behavior of the system.

Example

If you performed a long simulation and you want to find the most representative structure to subsequently perform a docking analysis, clustering can be used to identify the most populated conformations of the protein and then use them as inputs for your experiments.

Clustering in GROMACS

Clustering itself is just a procedure, not an algorithm. For this reason, in GROMACS you will find different algorithms to perform the clusterization of your trajectory. It’s up to you to decide the ones that best fit your needs.

One of the most commonly used algorithms for clustering (and the one we will consider in this article) is the Gromos Clustering. This method performs a clusterization procedure based on the values of RMSD.

How does it work?

The Gromos Clustering Algorithm works by counting the number of neighbors for each structure using an RMSD cut-off value. The structure with the largest number of neighbors is then chosen as the first cluster and all of its neighbors are included in the cluster. This structure and its neighbors are then eliminated from the pool of clusters.

The process is then repeated for the remaining structures until all structures have been assigned to a cluster.

To perform this type of analysis in GROMACS you can use a command-line tool, the gmx cluster module, which can be exploited to perform clustering on trajectory data.

This module can use different algorithms for clustering (gromos, Monte Carlo, …) that you can select via the -method flag, and also provides options to select different output files for your analysis.

Let’s see how to use it in more detail.

The gmx cluster module

Description

This command is used to cluster the frames in your simulation.

Let’s consider that I performed a 200ns MD simulation of my protein and I want to cluster the entire trajectory to find out the dominant conformations of my protein.

In the clusterization process, you don’t want to include the initial part of the trajectory that is not well-equilibrated. Therefore, the first step is to compute the RMSD and find out which is the part of your simulation that has equilibrated.

For the sake of this tutorial, we will consider that the protein has equilibrated after 50ns and that the rest of the trajectory can be used for the clustering analysis.

Now you have two options:

The first one is to extract the part of the trajectory (50-200ns) that is already equilibrated via the gmx trjconv module as showed here.
The second one is to directly work with the gmx cluster module via the -b 50000 option (starting the analysis from 50 000ps = 50ns).

Here we will consider the second option.

To cluster the equilibrated trajectory you can use the gmx cluster module as follows:

1

gmx cluster -f md.xtc -s md.tpr -method gromos -cutoff 0.15 -b 50000 -g -cl (-e -dt 10  -n index.ndx)

Syntax

Call the program with gmx
Select the cluster module
Call the -f flag and provide the entire trajectory file (md.xtc)
The -s flag is used to provide the tpr file of the run (md.tpr)
The -method flag is used to select the method we want to use (gromos in our case)
We can choose the cutoff value in nm (0.15) using the -cutoff option.
As explained above, we can select the starting point from the analysis depending on the equilibration of your system via -b
Call the -g flag and decide how you want to name the file listing all clusters and their members (by default cluster.log)
The -cl option writes the representative structure for each cluster in a pdb file (clusters.pdb)

Optional

You can pass an index file containing some particular groups you need (index.ndx) via the -n flag
You can choose the ending simulation time for the analysis with the -e flag
The -dt flag allows us to reduce the number of frames used for the analysis. In our example, we will only consider one frame every 10ps (-dt 10).

First, you will be asked to select a group for least squares fit and RMSD calculation. It is generally a good practice to choose the C-alpha group for the RMSD calculation, in most cases it corresponds to Group 3.

1

Group     3 (        C-alpha) has   XXX elements

Then you will also be prompted to select the part of the system that you want to include in the output file clusters.pdb. If your goal is to obtain the most representative protein conformation for each cluster, you can select Group 1 ( Protein) has XXXX elements. Alternatively, you can choose to include any other group that you desire.

It’s important to note that the clustering analysis can be computationally intensive and time-consuming, especially if you have a large system or a long simulation trajectory. Depending on the complexity of your system and the desired level of detail, the analysis may take several hours or even longer to complete. Therefore, it’s recommended to allocate sufficient computational resources and plan accordingly.

Output files of the analysis

As a first output, you will get a cluster.log file reporting information about the clusterization. The file is organized as follows.

In the upper part, you will find general information about the run such as the cutoff value you selected (just in case you forget it) and the number of clusters found (N).

1
2
3
4
5
6
7


Using gromos method for clustering
Using RMSD cutoff 0.15 nm
The RMSD ranges from XXXXX to XXXXX nm
Average RMSD is XXXXX
Number of structures for matrix XXX 
Energy of the matrix is XXXX
Found N clusters

Below that, you will find something resembling the table reported in this code block.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


 cl. | #st  rmsd | middle rmsd | cluster members
   1 | 19044  0.130 | 190000 .109 |  50000  50020  50030  50040  50060  50090  50210
     |           |             |  50250  50270  50450  50460  50480  50490  50500
     |           |             |  50520  50530  50540  50550  50560  50570  50730
     |           |             |  50760  50770  51080  51190  51200  51210  51230
     |           |             |  51270  51280  51290  51300  51310  51320  51330
     |           |             |  51340  51350  51370  51380  51390  51400  51430
     |           |             |  51440  51450  51460  51470  51480  51490  51520
     |           |             |  51530  51540  51560  51570  51580  51590  51600
     |           |             |  51610  51620  51630  51640  51650  51660  51680
     |           |             |  51690  51700  51710  51720  51730  51740  51750
 .
 .
 .
 2 | 816  0.136 |  56140 .120 |  50010  50070  50080  50100  50110  50120  50130
     |           |             |  50140  50150  50160  50170  50180  50190  50200
     |           |             |  50240  50260  50280  50290  50300  50310  50320
     |           |             |  50330  50360  50380  50390  50400  50430  50440
     |           |             |  50470  50510  50580  50590  50640  50650  50660
     |           |             |  50670  50690  50710  50720  50740  50780  50790
     |           |             |  50810  50820  50840  50850  50870  50890  50900
     |           |             |  50920  50930  50940  50950  50960  50970  50980
     |           |             |  50990  51000  51010  51020  51030  51040  51050
.
.
.
 N |   5  0.142 |  69040 .118 |  50230  69020  69040  69200  72790

The cluster.log is organized in a tabular format and each cluster is grouped for easy reading.

The first column (cl.) specifies the number of clusters which are ordered from the most populated one (19044 frames) to the lowest (5)
The second column shows the number of frames (#st) belonging to each cluster
The third column (middle) reports the representative frame for each cluster, namely the one with the lowest average RMSD to all other structures within the cluster. (in our example frame 190000 would be the representative structure for the trajectory)
Finally, the last column shows all the frames belonging to each cluster

Note

Choosing the appropriate cutoff value is crucial in the clustering process and it will depend on the specific system and research question being studied. The ideal cutoff should result in a manageable number of clusters, a significant portion of the trajectory being grouped into a smaller number of clusters, and a minimal number of clusters with only one member. This is important to ensure meaningful and accurate clustering.

The clusters.pdb is the second output file we will consider here. It contains the most representative structure for each cluster in the PDB format. The contents of the file will depend on the selection made at the above-mentioned second prompt, where you can choose to include different parts of your system.

You can then proceed to visualize the structure in PyMOL and use it for your subsequent analysis.