GROMACS File Formats: Understanding topology, itp, and gro files

 

GROMACS, a widely used molecular dynamics simulation software, relies on specific file formats to communicate with the user and define the properties of molecules.

Understanding these file formats is crucial for setting up and analyzing simulations accurately.

In this blog post, we will explore three essential file formats in GROMACS and provide practical tips for working with them.

 

 

The gro file format is a plain text file storing spatial coordinates and velocities (if available) of atoms during a molecular dynamics simulation. It follows a specific format that is crucial to understand if you plan to work with GROMACS.

A typical file starts with two lines like this:

1
2
Title t= 156000.00000 step= 156000000
231983

 

The first line is a simple Title entry, which is automatically generated by GROMACS when the file is created using the gmx trjconv command. if that is the case, the line contains information about the time and step of the simulation. The t= entry specifies the time in picoseconds (ps), while the step= entry specifies the step number.

The second line specifies the number of atoms in the system, which is a crucial parameter for performing various calculations and analysis tasks. The number of atoms is an integer value and should match the number of atoms in the system.

Note
When you manually format a gro file always remember to modify this number to match the total number of atoms in the topology file to avoid errors during the simulation.

The rest of the file works much like a pdb file, each line in the gro file corresponds to an atom in the system and contains several columns with different information.

Here is an example of how you could find the simplest amino acid (Glycine) written in a gro file:

1
2
3
4
5
6
7
243GLY      N 3982   5.064   4.383   7.880 -0.1954  0.1966  0.0028
243GLY      H 3983   5.071   4.375   7.780 -0.6142  2.7367 -0.2744
243GLY     CA 3984   5.106   4.514   7.926  0.4334 -0.0836  0.2363
243GLY    HA2 3985   5.172   4.511   8.013 -3.5828 -1.5010  3.3792
243GLY    HA3 3986   5.160   4.566   7.847  2.4291 -0.7782  1.0870
243GLY      C 3987   4.991   4.608   7.959  0.4018 -0.1763  0.3836
243GLY      O 3988   5.019   4.699   8.036  0.2410  0.2285 -0.0364

 

Let’s take the first row and break it down to see what each component means:

1
243GLY      N 3982   5.064   4.383   7.880 -0.1954  0.1966  0.0028
Format
  1. Residue number (5 positions, integer): Specifies the residue number (243) to which the atom belongs. It is an integer value with 5 positions, indicating the sequential order of the residue in the molecule.

  2. Residue name (5 positions, characters): This column contains the name of the residue to which the atom belongs. It is a 5-character string that represents the type of residue, GLY in our example.

  3. Atom name (5 positions, characters): This column contains the name of the atom. It is a 5-character string that represents the type of atom, such as CA for alpha carbon, N for nitrogen, and so on.

  4. Atom number (5 positions, integer): This column specifies the atom number, which is a unique identifier for each atom in the system. It is an integer value with 5 positions, indicating the sequential order of the atom in the molecule.

  5. Position (in nm, x y z in 3 columns, every 8 positions with 3 decimal places): This column contains the x, y, and z coordinates of the atom in nanometers (nm). The coordinates are listed in three columns with each column having 8 positions and 3 decimal places, allowing for high precision.

  6. Velocity (in nm/ps, x y z in 3 columns, every 8 positions with 4 decimal places): This column contains the velocity of the atom in nanometers per picosecond (nm/ps) or kilometers per second (km/s). It also includes the x, y, and z components of the velocity, listed in three columns with each column having 8 positions and 4 decimal places, allowing for high precision. If velocities are not available, this column can be omitted from the file.

 

The last line of a gro file contains information about the size of the simulation box. The line contains three numbers, which represent the size of the box in nanometers (nm) in the x, y, and z directions, respectively.

For example, a line that looks like this:

1
  10.37454  10.37454  15.63914

 

specifies a simulation box with a length of 10.37454 nm in the x and y directions and a height of 15.63914 nm in the z-direction.

 

Visualize a gro file
Finally, always remember that, since a gro file is a text file that stores the coordinates of atoms, you can always use molecular visualization software such as PyMOL to visualize the structure.

 

 

If you have already some experience with molecular simulation for sure you will have heard about the topology of a system. But what is that exactly?

You can think of the topology file as the molecular equivalent of a resume. It contains all the important information about the system you’re studying.

Explained in more rigorous terms, the topology file is where you define the parameters for how the atoms in your molecule interact with each other. That includes bonded interactions and non-bonded interactions, but also constraints or exclusions.

So, you can see that it is an essential component of any molecular simulation as it defines the interactions between atoms, which ultimately dictate the motion of the system under study.

In GROMACS the topology file is a simple text file characterized by the top extension (generally topol.top) which can be created with the gmx pdb2gmx command.

 

The next logical question that arises is where these parameters come from.

If you carefully went through my blog the answer should be quite straightforward. They come from the force field. That’s why after you launch the pdb2gmx command you are required to select one so that GROMACS can retrieve the corresponding parameters that will be used in your simulation.

 

 

Now let’s have a look at what a typical topol.top file may look like. To inspect the contents of the file you can simply open it with a plain text editor (vi, nano, …). You can open the box code below to look at a sample topology file.

Sample topology file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
;   File 'topol.top' was generated
;   By user: user
;   On host: XXX
;   At date: 
;
;   This is a standalone topology file
;
;   Created by:
;                   :-) GROMACS - gmx pdb2gmx, 2020.5-MODIFIED (-:
;
;   Executable:   /usr/local/bin/gromacs-2020.5+plumed-2.7.1+PyTorch/bin/gmx
;   Data prefix:  /usr/local/bin/gromacs-2020.5+plumed-2.7.1+PyTorch
;   Working dir:  /home/user/
;   Command line:
;     gmx pdb2gmx -f ...
;   Force field was read from the standard GROMACS share directory.
;

; Include forcefield parameters
#include "amber99sb.ff/forcefield.itp"

[ moleculetype ]
; name  nrexcl
Protein         3

[ atoms ]
; nr    type    resnr   residu  atom    cgnr    charge  mass
; residue   1 GLY rtp GLY  q  0.0
1          N      1    GLY      N     61    -0.4157      14.01
2          H      1    GLY      H     62     0.2719      1.008
3         CT      1    GLY     CA     63    -0.0252      12.01
4         H1      1    GLY    HA1     64     0.0698      1.008
5         H1      1    GLY    HA2     65     0.0698      1.008
6          C      1    GLY      C     66     0.5973      12.01
7          O      1    GLY      O     67    -0.5679         16   ; qtot 2

[bonds]
.
.
.

[pairs]
.
.
.

[angles]
.
.
.

[dihedrals]
.
.
.


; Include Position restraint file
#ifdef POSRES
#include "posre.itp"
#endif


; Include water topology
#include "amber99sb.ff/tip3p.itp"

#ifdef POSRES_WATER
; Position restraint for each water oxygen
[ position_restraints ]
;  i funct       fcx        fcy        fcz
   1    1       1000       1000       1000
#endif

; Include topology for ions
#include "amber99sb.ff/ions.itp"

[ system ]
; Name
Protein

[ molecules ]
; Compound        #mols
Protein             1

 

The file generally starts with several lines preceded by a semicolon ; which are general comments.

After the comments, you’ll see the line that calls the parameters within the force field you selected (amber99sb). This line indicates that all subsequent parameters are derived from this force field.

1
#include "amber99sb.ff/forcefield.itp" 

 

The next important line is [ moleculetype ] which defines the name and exclusions of the molecules. In the given example, the molecule is named Protein and has nrexcl 3, that is, excluding non-bonded interactions between atoms that are no further than 3 bonds away.

1
2
3
[ moleculetype ]
; name  nrexcl
Protein         3

 

The [ atoms ] section lists all of the atoms in the protein, with the information presented in columns. Each row corresponds to a different atom in the protein, with details such as the atom number, type, residue number, residue name, atom name, and charge. In the example is reported a Glycine residue.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[ atoms ]
; nr    type    resnr   residu  atom    cgnr    charge  mass
; residue   1 GLY rtp GLY  q  0.0
1          N      4    GLY      N     61    -0.4157      14.01
2          H      4    GLY      H     62     0.2719      1.008
3         CT      4    GLY     CA     63    -0.0252      12.01
4         H1      4    GLY    HA1     64     0.0698      1.008
5         H1      4    GLY    HA2     65     0.0698      1.008
6          C      4    GLY      C     66     0.5973      12.01
7          O      4    GLY      O     67    -0.5679         16   ; qtot 2

 

Following that you have other sections specifying other interactions such as [bonds], [pairs],[angles], and [dihedrals] parameters.

The remaining sections of topol.top define other useful/necessary topologies. For example, the posre.itp file defines a force constant used to keep atoms in place during the equilibration phase.

1
2
3
4
; Include Position restraint file
#ifdef POSRES
#include "posre.itp"
#endif

 

Finally, the [ system ]directive gives the name of the system that will be written to output files during the simulation, while the [ molecules ] directive lists all of the molecules in the system.

1
2
3
4
5
6
7
[ system ]
; Name
Protein

[ molecules ]
; Compound        #mols
Protein             1

 

Note

It’s crucial to ensure that the order and names of the molecules listed in the [ molecules ] directive exactly match those in the coordinate file (i.e., the gro file).

For instance, if your gro file contains a protein (Protein), followed by a ligand (LIG), and a cholesterol membrane (CHL) composed of X molecules, then the [ molecules ] directive should be as follows:

1
2
3
4
5
[ molecules ]
; Compound        #mols
Protein             1
LIG                 1
CHL                 X

Even a slight mismatch in the order or names of molecules between the [ molecules ] directive and the gro file will result in an error.

Also, make sure that the names listed match the [ moleculetype ] names otherwise you will receive errors concerning atom types not matching.

 

 

If you looked closely at the previous files, you may be wondering, Why exactly are the itp files passed via the include statement?

In molecular dynamics simulations, complex systems often involve a large number of molecules with various properties and interactions.

Specifying all the necessary parameters for such systems in a single file can quickly turn into something difficult to manage. For this reason, it’s considered more practical to use the include mechanism to add parameters/moleculetypes using itp files.

Therefore, an itp (which stands for Include Topology) file is simply another text file that contains molecular topology information, such as bond lengths, bond angles, dihedral angles, and force constants, for a specific molecule or group of molecules. These files can then be included in the main topology file using the #include directive.

Suppose you have a complex system with a protein, ligands, and various lipids that make up your membrane. It becomes apparent that if you attempt to incorporate all of the parameters we previously showed into a single file, it will rapidly become disorganized.

So you may encounter a topology where the parameters for different components are grouped into different itp files:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
; Include forcefield parameters
#include "toppar/forcefield.itp"
#include "toppar/PROA.itp"
#include "toppar/LIG.itp"
#include "toppar/CHL.itp"
#include "toppar/DOPC.itp"
#include "toppar/POPE.itp"
#include "toppar/Na+.itp"
#include "toppar/Cl-.itp"
#include "toppar/TP3.itp"

[ system ]
; Name
Title

[ molecules ]
; Compound  #mols
PROA               1
LIG                1
CHL1              85
DOPC             144
POPE              32
Na+              106
Cl-               72
TP3            26449

 

You can immediately see that the use of the include statements is useful for making the topology compact, rather than writing out all parameters explicitly. As a result, you will get a much cleaner topology file.

 

 

In conclusion, understanding the topology, itp, and gro files is crucial for setting up and running molecular dynamics simulations using GROMACS. However, there are other file formats you need to be comfortable with for various purposes.

I link you to a series of posts where I discuss them in more detail: