There seems to be a lot of redundancy in PDB files. These files can of course be compressed with general-purpose compression programs like gzip, but I can't help but imagine that these tools are overlooking a significant amount of redundancy in PDB files. Are there compressors that specifically target PDB files? If not, what are some aspects of PDB files that are ripe for compression?
Looking at a typical PDB file, some redundancies are immediately apparent. Other redundancies are less obvious. Consider this excerpt of two residues from 1MOB (myoglobin):
ATOM 332 N LYS A 42 16.481 27.122 -10.033 1.00 11.15 N
ATOM 333 CA LYS A 42 15.926 28.134 -9.159 1.00 8.64 C
ATOM 334 C LYS A 42 16.970 29.081 -8.512 1.00 16.74 C
ATOM 335 O LYS A 42 16.687 30.075 -7.799 1.00 11.84 O
ATOM 336 CB LYS A 42 15.093 27.489 -8.043 1.00 18.03 C
ATOM 337 CG LYS A 42 13.731 26.888 -8.502 1.00 19.65 C
ATOM 338 CD LYS A 42 12.679 27.912 -8.953 1.00 17.94 C
ATOM 339 CE LYS A 42 11.438 27.406 -9.703 1.00 24.82 C
ATOM 340 NZ LYS A 42 10.474 28.567 -9.803 1.00 19.81 N
ATOM 341 N PHE A 43 18.218 28.599 -8.544 1.00 12.28 N
ATOM 342 CA PHE A 43 19.311 29.318 -7.919 1.00 11.81 C
ATOM 343 C PHE A 43 20.223 30.024 -8.949 1.00 10.95 C
ATOM 344 O PHE A 43 21.201 29.462 -9.450 1.00 10.08 O
ATOM 345 CB PHE A 43 20.138 28.301 -7.137 1.00 9.30 C
ATOM 346 CG PHE A 43 19.494 27.689 -5.877 1.00 9.53 C
ATOM 347 CD1 PHE A 43 19.572 28.376 -4.679 1.00 12.01 C
ATOM 348 CD2 PHE A 43 18.837 26.465 -5.923 1.00 10.54 C
ATOM 349 CE1 PHE A 43 18.993 27.861 -3.536 1.00 9.59 C
ATOM 350 CE2 PHE A 43 18.261 25.959 -4.775 1.00 8.62 C
ATOM 351 CZ PHE A 43 18.341 26.666 -3.597 1.00 7.89 C
These two residues occupy 1,638 bytes as plain text; when compressed with gzip, they occupy 467 bytes. For reference, the format of ATOM records in PDB files is defined at wwpdb.org/documentation/format33/sect9.html#ATOM.
Almost all of the data in the above excerpt seems redundant. The first field (ATOM), second field (atom index, e.g. 332 in the first row), sixth field (residue index, e.g. 42), tenth field (occupancy, e.g. 1.00) and last field (element name, e.g. N) seem clearly extraneous. The fourth field (residue name) could be shortened from three characters to 1 character, or simply an integer. I'm not a data compression expert, but I imagine gzip picks up most of this redundancy.
Slightly less obviously, the atom names for each residue also seem unnecessary. To my understanding, the atomic composition of all residues' backbones will always be the same, and represented in PDB files as "N", "CA", "C", "O". The same for the atomic composition of the residues' respective sidechains: a lysine sidechain will always be "CB", "CG", "CD", "CE", "NZ" and a phenylalanine sidechain will always be "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ".
A subtler redundancy, but one that might increase compressibility a lot, seems like it could be in the atomic coordinates themselves. For example, in the backbone, would it be possible to deduce each residue atom's X, Y and Z coordinates (12 data points: 4 atoms * 3 coordinates) given only their phi, psi and omega dihedral angles (3 data points)? Could applying dihedral angles to atoms within sidechains similarly remove the need to explicitly list the 3D coordinates there?
Could "temperature factor" (the second to last field in the excerpt) be losslessly removed, or compressed in some non-obvious way? What are some other possible optimizations that could be used to more efficiently compress PDB files? Are there any obvious grave performance implications of these various compression techniques on the speed of a hypothetical decompressor to convert back to the official PDB format? Have these questions been answered in the literature or an existing PDB-specific compression program?
Thanks in advance for any answers or feedback.
Edit:
Given that no PDB-specific file compressors seem to be available, I suppose my specific goal is to develop one. One potential application I see for this is in significantly decreasing fresh times-to-render in certain use cases of browser-based molecular visualization programs, e.g. Jmol, ChemDoodle Web Components or GLmol. Another application could be decreasing the time and size of data needed to download archives of PDB files like those described here.
This would of course require a way to efficiently decompress the packed PDB files, but this trade-off between decompression time and download time seems like it could be useful in at least some niche applications.
Edit 2:
In a comment, nico asks "How would compressing the file decrease render time?". Decreasing gzipped PDB file size (e.g. by half or more) and thus decreasing time needed to download the file would decrease the time between when the PDB file was requested from a remote server and when the structure was rendered by a molecular visualization program running on a client machine. Apologies if that use of "fresh time-to-render" in that context was unclear.
A lossless compression could also involve encoding the PDB file to an object (e.g. JSON) that is faster to parse for the visualization program, and decrease render times that way. Looking around further, if the application only required displaying the 3D structure and not also retaining data about specific atoms and residues, then using a binary mesh compression (e.g. webgl-loader) seems like it would probably decrease time-to-render even more.
No comments:
Post a Comment