I have an observation and a correlation that is strong enough to present in a preliminary and semi-quantitative form before the full study ends.  If one calculates the distance in nucleotides between stop codons within each of the six reading frames of the entire genome of bacteria, you will find this "non-stop" DNA (nsDNA) can be partitioned into two categories.  Short nsDNA is highly correlated with pathogens while long nsDNA seems to be correlated with bacteria that live in a community (such as soil or gut).  The difference is easily observed in histograms, especially if only the longest 10 nsDNAs are plotted.  One then sees that the communal nsDNA is about five to ten times longer than the pathogen nsDNA.  (Nonribosomal peptide synthase (NPS) is an outlier at 15,000+ amino acid residues.)

In addition, we find that about 50% of all nsDNA has an opposing frame that is also nsDNA. Usually, this would be taken to be a Complementary Protein (CP) pair. However, taking the reverse complement of a protein-encoding sequence seems an unlikely way to make a second functional protein.  Only the second position of the codon remains at the second position: hydrophobic NTN becomes hydrophilic NAN, and NCN which is tightly constrained in selection of both hydropathy and molar volume (HMV) becomes NGN which encodes a set of amino acid residues which are extreme outliers in HMV.  The wobble of position 3 (supporting redundancy) becomes the stringency of position 1 - important in determining amino acid identity.  This is a highly contradictive method for creating a new protein.  This suggests more of a regulatory nature for most of CP nsDNA.  In fact, we would like to suggest that this RNA-RNA regulation could in-part be intercellular and support the coordinated metabolism of the community.

Other than the length measurements, our analyses are limited to correlation, not causation. Therefore, to go forward with this hypothesis, one would have to find a molecular mechanism for these mRNAs to interact between pairs or among groups of different species in this putative multicellular "organism".

If one added computational analyses for a starting methionine, transcriptional and translational signals, the CP nsDNA would be slightly shortened and could be aptly named a DORF because it is part of a double ORF.

I would be very pleased if my preeminent graduate student of 25 years ago (Adam Arkin) redid these calculations.

Note that the length differences between communal and pathogenic nsDNA must go through a normalization in consideration of the much smaller genome size of the latter.

Random 50% GC 4M bp {404, 368, 347, 332, 332, 326, 317, 317, 317, 314}

Random 70% GC 4M bp {314, 272, 266, 254, 254, 251, 248, 248, 245, 242}

Random 30% GC 4M bp {272, 230, 221, 212, 206, 203, 200, 197, 194, 194}

Rerun 30% {239, 227, 221, 215, 212, 212, 209, 206, 197, 191}

Human Chromosome 21q 29M bp {3041, 2633, 1718, 1682, 1589, 1478, 1445, 1361, 1211, 1154}

Elephant shark 18M bp {1865, 1535, 1223, 1211, 1175, 1121, 989, 980, 932, 932}

Azotobacter_vinelandii {13022, 4466, 4292, 4211, 4145, 4022, 4007, 3962, 3929, 3917}

Pseudomonas aeruginosa PA96 {7931, 6584, 6365, 4775, 4760, 4454, 4325, 4271, 4235, 4211}

Rhodobacter capsulatus SB1003 {6209, 6146, 5513, 4343, 4259, 4205, 3914, 3824, 3752, 3650}

Stenotrophonas maltophilia {5843, 5675, 5021, 4919, 4499, 4274, 4259, 4214, 3902, 3878}

Synehococcus sp. WH 5701 {4094, 2855, 2750, 2552, 2528, 2519, 2489, 2423, 2396, 2357}

Halobacterium salinarum R1 {3482, 3131, 2936, 2783, 2756, 2729, 2720, 2681, 2576, 2576}

E. Coli K-12 MG1655 {2975, 2897, 2534, 2411, 2258, 2147, 2096, 1817, 1793, 1778}

Pichia sorbitophila CBS7064 chr N {1184, 1094, 983, 935, 875, 872, 872, 857, 848, 839}

Bacillus cereus NC7401 {1148, 1058, 980, 920, 920, 914, 797, 746, 731, 713}

Streptococcus pneumoniae PCS8235 {1007, 947, 779, 746, 737, 737, 731, 704, 701, 701}

Vibrio cholerae {983, 914, 869, 860, 854, 827, 800, 785, 758, 743}

Spathaspora_passalidarum {935, 872, 710, 629, 614, 590, 587, 581, 581, 578}

Staphlococcus aureus MRSA252 {782, 767, 521, 509, 479, 467, 455, 425, 398, 392}

Helicobacter_pylori _ 2018 {704, 551, 530, 497, 473, 458, 458, 455, 455, 452}

Sulfolobus acidocaldarius DSM 639 {617, 419, 371, 341, 341, 335, 326, 323, 323, 320}

Clostoridium botulinum {518, 497, 452, 413, 413, 410, 389, 383, 380, 377}

Borrelia burgdorferi B31 {494, 449, 446, 431, 407, 389, 389, 383, 371, 368}

Prochlorococcus marinus MED4 {410, 386, 338, 335, 320, 320, 314, 308, 305, 296} (verified twice )

Mycoplasma genitalium G-37 {410, 344, 332, 302, 293, 293, 266, 257, 254, 242}

{Sequence Position (end), CP Length} Top 10

More Douglas C Youvan's questions See All
Similar questions and discussions