What is a PDB file?

 

When it comes to computational chemistry, file formats are extremely important.

They are essential tools that give researchers the possibility to conserve complicated molecular structures in files that you can easily transfer to other machines or share with colleagues.

If you are a scientist or a student working in the field of computational chemistry or structural biology, you might have heard of a particular format generally referred to as “pdb file”. But what is it?

In this blog post, we will dive into the details of what a PDB file is, where you can find it, how it is formatted, and finally how you can open and visualize it.

All kinds of information you should be familiar with to successfully carry out your research in chemistry or biology.

 

 

A PDB file, or Protein Data Bank file, is a file format used to store information about the 3D structure of biological macromolecules such as proteins, and nucleic acids.

The file format is widely used in computational chemistry because it is easy to read and understand, and it can be used with many different software programs.

You can recognize it from its peculiar .pdb extension, and it is considered the standard way to store and transfer information about proteins and nucleic acids.

In many cases, you will receive PDB files as output from different programs so it would be better to be familiar with them.

However, the primary source of PDB files is the Protein Data Bank (PDB), a public database where researchers can deposit their experimentally determined structures of biological macromolecules.

I will not dive into the detail of how the databank works. I just want you to know that this is the place where you can find and download the experimentally determined structures of interest, and that each one of them has a unique four digits code.

 

 

The PDB file format contains a lot of structural information such as the name and coordinates of each atom, the corresponding residues, and much more. In some cases, PDB files may also include metadata such as the authors of the original research, and the experimental method used to obtain the structure.

When you open a PDB file (you can do it with any text editor) you will find that it is composed of different lines, where each line is referred to as a record. Different types of records are available and they contain different information about the system.

Three of them are the important ones that you should know, as they are the one containing info about atoms in your system:

  • The ATOM record: info about standard amino acids
  • The HETATM record: info about non-standard residues e.g., ligands
  • The TER record: signals the end of a chain of residues.

Let’s analyze more in-depth how atoms in a typical PDB file are formatted.

 

 

Here is a snippet of a file reporting a certain residue of a protein (Tyrosine 36).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
ATOM     21  N   TYR A  36      50.550  51.010  47.480  1.00  25.00           N
ATOM     22  HN  TYR A  36      50.740  50.630  46.580  1.00  16.00           H
ATOM     23  CA  TYR A  36      51.280  50.370  48.530  1.00  16.00           C
ATOM     24  HA  TYR A  36      50.600  50.140  49.330  1.00  16.00           H
ATOM     25  CB  TYR A  36      51.880  49.050  48.020  1.00  13.00           C
ATOM     26  HB1 TYR A  36      50.990  48.400  47.860  1.00  35.00           H
ATOM     27  HB2 TYR A  36      52.530  48.520  48.740  1.00  25.00           H
ATOM     28  CG  TYR A  36      52.520  49.090  46.670  1.00  24.00           C
ATOM     29  CD1 TYR A  36      51.810  48.970  45.500  1.00  46.00           C
ATOM     30  HD1 TYR A  36      50.730  48.930  45.570  1.00  12.00           H
ATOM     31  CE1 TYR A  36      52.420  48.780  44.290  1.00  12.00           C
ATOM     32  HE1 TYR A  36      51.840  48.590  43.390  1.00  12.00           H
ATOM     33  CZ  TYR A  36      53.780  48.970  44.170  1.00  47.00           C
ATOM     34  OH  TYR A  36      54.400  48.680  42.940  1.00  12.00           O
ATOM     35  HH  TYR A  36      55.270  48.990  43.170  1.00  10.00           H
ATOM     36  CD2 TYR A  36      53.880  49.040  46.540  1.00  10.00           C
ATOM     37  HD2 TYR A  36      54.460  49.130  47.450  1.00  15.00           H
ATOM     38  CE2 TYR A  36      54.520  49.090  45.320  1.00  17.00           C
ATOM     39  HE2 TYR A  36      55.580  49.300  45.230  1.00  37.00           H
ATOM     40  C   TYR A  36      52.380  51.190  49.130  1.00  49.00           C

Each line in the PDB file represents a single atom in the molecule, and the entries on each line provide information about the atom’s properties.

 

Note

You may have noticed that the file looks well organized. Note that this is not a case

Each entry needs to be exactly in a specified range otherwise the PDB file will not be read correctly by molecular visualization software.

If you are not able to visualize your PDB file correctly always double-check that the file is properly formatted.

 

Let’s break down the entry in the first line:

1
ATOM     21  N   TYR A  36      50.550  51.010  47.480  1.00  25.00           N
Columns Data
1-4 The first entry is the record name, which is always ATOM for atoms in the molecule.
7-11 Atom serial number, which is a unique identifier for each atom in the molecule (21).
13-16 Atom name, which identifies the type of atom. In this case, it is N, which stands for nitrogen.
17 Alternate location indicator, missing in this case
18-20 Residue name, which identifies the amino acid residue to which the atom belongs. In this case, it is TYR for tyrosine.
22 Chain identifier, which identifies the chain to which the atom belongs. In this case, it is chain A.
23-26 Residue sequence number, which identifies the position of the amino acid residue in the chain. In this case, it is residue number 36.
31-38 x coordinate of the atom in the three-dimensional space (50.550 Å).
39-46 y coordinate of the atom in the three-dimensional space (51.010 Å).
47-54 z coordinate of the atom in the three-dimensional space (47.480 Å).
55-60 Occupancy, which indicates the fraction of unit cells that contain the atom. In this case, it is 1.00.
61-66 Temperature factor or B-factor, which indicates the mobility of the atom. In this example, it is 25.00.
73-76 Segment identifies, missing in this case.
77-78 Element (N).
79-80 Charge, missing in this case.

Similar rules apply for all the other atoms record types (HETATM, SER). You can find more info on formatting and common errors here

 

 

What if you want to open a PDB file? There are two ways to go about it. Let’s dive into both of them.

 

 

The first option is to open them with a good old text editor (less, nano, vi, …). This can be useful for researchers who want to manipulate the data in the file or extract specific pieces of information.

However, it is important to note that PDB files can be quite large and complex, and it is quite easy to mess everything up if you don’t know what you are doing. That’s why it may be difficult to work with them in a text editor.

In addition to this, opening the file with a simple text editor does not give you a visual representation of the structure making it difficult to get an idea of what is exactly going on in the protein.

Why settle for text files when you can see your molecule in its entirety?

 

 

The most common use of a PDB file is for molecular visualization. Scientists can use specialized software to generate 3D models of the molecule based on the information contained in the PDB file.

That’s why, unless you have some very specific needs, it is probably better to use the second option, open a PDB file with a molecular visualization program.

You can achieve this using a variety of software packages, such as PyMOL (following the procedure we discussed here), VMD, or Chimera. These programs allow users to create a visual representation of the pdb text file so that you can explore the structure of the macromolecule in 3D, manipulate it, and analyze its properties.

These tools can help you gain insights into the structural and functional features of the macromolecule and also create visual representations to help you share your research with the rest of the scientific community.

PyMOL, for instance, offers a wide range of tools that allow you to do pretty much anything you want with your molecule.

If you want to read further, here are some useful articles showing how you can play with your PDB file using PyMOL: