bioinformatics - Compressing structural information in PDB files

Tuesday, 27 March 2007

bioinformatics - Compressing structural information in PDB files

There seems to be a lot of redundancy in PDB files. These files can of course be compressed with general-purpose compression programs like gzip, but I can't help but imagine that these tools are overlooking a significant amount of redundancy in PDB files. Are there compressors that specifically target PDB files? If not, what are some aspects of PDB files that are ripe for compression?

Looking at a typical PDB file, some redundancies are immediately apparent. Other redundancies are less obvious. Consider this excerpt of two residues from 1MOB (myoglobin):

ATOM    332  N   LYS A  42      16.481  27.122 -10.033  1.00 11.15           N  
ATOM    333  CA  LYS A  42      15.926  28.134  -9.159  1.00  8.64           C  
ATOM    334  C   LYS A  42      16.970  29.081  -8.512  1.00 16.74           C  
ATOM    335  O   LYS A  42      16.687  30.075  -7.799  1.00 11.84           O  
ATOM    336  CB  LYS A  42      15.093  27.489  -8.043  1.00 18.03           C  
ATOM    337  CG  LYS A  42      13.731  26.888  -8.502  1.00 19.65           C  
ATOM    338  CD  LYS A  42      12.679  27.912  -8.953  1.00 17.94           C  
ATOM    339  CE  LYS A  42      11.438  27.406  -9.703  1.00 24.82           C  
ATOM    340  NZ  LYS A  42      10.474  28.567  -9.803  1.00 19.81           N  
ATOM    341  N   PHE A  43      18.218  28.599  -8.544  1.00 12.28           N  
ATOM    342  CA  PHE A  43      19.311  29.318  -7.919  1.00 11.81           C  
ATOM    343  C   PHE A  43      20.223  30.024  -8.949  1.00 10.95           C  
ATOM    344  O   PHE A  43      21.201  29.462  -9.450  1.00 10.08           O  
ATOM    345  CB  PHE A  43      20.138  28.301  -7.137  1.00  9.30           C  
ATOM    346  CG  PHE A  43      19.494  27.689  -5.877  1.00  9.53           C  
ATOM    347  CD1 PHE A  43      19.572  28.376  -4.679  1.00 12.01           C  
ATOM    348  CD2 PHE A  43      18.837  26.465  -5.923  1.00 10.54           C  
ATOM    349  CE1 PHE A  43      18.993  27.861  -3.536  1.00  9.59           C  
ATOM    350  CE2 PHE A  43      18.261  25.959  -4.775  1.00  8.62           C  
ATOM    351  CZ  PHE A  43      18.341  26.666  -3.597  1.00  7.89           C

These two residues occupy 1,638 bytes as plain text; when compressed with gzip, they occupy 467 bytes. For reference, the format of ATOM records in PDB files is defined at wwpdb.org/documentation/format33/sect9.html#ATOM.

Almost all of the data in the above excerpt seems redundant. The first field (ATOM), second field (atom index, e.g. 332 in the first row), sixth field (residue index, e.g. 42), tenth field (occupancy, e.g. 1.00) and last field (element name, e.g. N) seem clearly extraneous. The fourth field (residue name) could be shortened from three characters to 1 character, or simply an integer. I'm not a data compression expert, but I imagine gzip picks up most of this redundancy.

Slightly less obviously, the atom names for each residue also seem unnecessary. To my understanding, the atomic composition of all residues' backbones will always be the same, and represented in PDB files as "N", "CA", "C", "O". The same for the atomic composition of the residues' respective sidechains: a lysine sidechain will always be "CB", "CG", "CD", "CE", "NZ" and a phenylalanine sidechain will always be "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ".

A subtler redundancy, but one that might increase compressibility a lot, seems like it could be in the atomic coordinates themselves. For example, in the backbone, would it be possible to deduce each residue atom's X, Y and Z coordinates (12 data points: 4 atoms * 3 coordinates) given only their phi, psi and omega dihedral angles (3 data points)? Could applying dihedral angles to atoms within sidechains similarly remove the need to explicitly list the 3D coordinates there?

Could "temperature factor" (the second to last field in the excerpt) be losslessly removed, or compressed in some non-obvious way? What are some other possible optimizations that could be used to more efficiently compress PDB files? Are there any obvious grave performance implications of these various compression techniques on the speed of a hypothetical decompressor to convert back to the official PDB format? Have these questions been answered in the literature or an existing PDB-specific compression program?

Thanks in advance for any answers or feedback.

Edit:

Given that no PDB-specific file compressors seem to be available, I suppose my specific goal is to develop one. One potential application I see for this is in significantly decreasing fresh times-to-render in certain use cases of browser-based molecular visualization programs, e.g. Jmol, ChemDoodle Web Components or GLmol. Another application could be decreasing the time and size of data needed to download archives of PDB files like those described here.

This would of course require a way to efficiently decompress the packed PDB files, but this trade-off between decompression time and download time seems like it could be useful in at least some niche applications.

Edit 2:

In a comment, nico asks "How would compressing the file decrease render time?". Decreasing gzipped PDB file size (e.g. by half or more) and thus decreasing time needed to download the file would decrease the time between when the PDB file was requested from a remote server and when the structure was rendered by a molecular visualization program running on a client machine. Apologies if that use of "fresh time-to-render" in that context was unclear.

A lossless compression could also involve encoding the PDB file to an object (e.g. JSON) that is faster to parse for the visualization program, and decrease render times that way. Looking around further, if the application only required displaying the 3D structure and not also retaining data about specific atoms and residues, then using a binary mesh compression (e.g. webgl-loader) seems like it would probably decrease time-to-render even more.

Answer Desk

Tuesday, 27 March 2007

bioinformatics - Compressing structural information in PDB files

No comments:

Post a Comment