Thursday 23 October 2014

codon usage - How are the various classes of E coli genes determined?

I read through the paper. The author starts by stating that as of the time of writing, two different classes of codon usage profiles were known (or at least putatively so). All 782 unique CDS sequences used were subjected to a two-step classification method. In step one, each CDSs was broken down into a 61-dimensional vector representing each of the 61 possible codons. A factorial cluster analysis (the categorical, multi-variate equivalent of principle component analysis) was run on these vectors, condensing 61 dimensions down to 2 dimensions. Now that the data complexity has been reduced to 2D, it is more manageable for a k-means algorithm to partition the data. In the end, the genes were clustered into 3 orthogonal groups (classes I, II and III, with 502, 191 and 89 CDS, respectively).



Only after the authors clustered the gene set were they able to go back and look at the canonical definitions of each gene. It so happened, fortuitously, that each class of the genes had a strong bias for subsets of cellular function (eg, metabolism, protein biosynthesis, transport). They did not use proteome data, but they were able to define the role for a large number of these genes based on the body of literature at the time.

No comments:

Post a Comment