Is my concept, based on which I intend to predict the function of genes, which are not yet known, feasible and if so how?

It seems to work most reliable for the Gene Ontology term

“Mitochondrial Large Ribosomal Subunit” (MLRS)

Somebody told me at www.researchgate.net that the correlation of gene expression patterns is highest for the GO term “Mitochondrial Large Ribosomal Subunit (MLRS).

I finally succeeded in visualizing this hypothesis with the plotting function in Python. Please see the high correlations between the trajectories of the gene expression time series plots between all approximately 40 member genes, which belong to the GO term MLRS in the graph below on attached word file:

It worked beautifully. It is very easy to tell by simple visual inspection that all the gene members, which have been grouped together in the GO Term MLRS are very highly co-expressed, i.e. much higher than could be expected by random chance alone. This implies that all the member genes seem to work very well together in a highly cooperative manner.

The Microarray data for this from which I have plotted the gene expression trajectories above come from a yeast dataset consisting of 81 time points, at which the mRNA levels were measured in approximately three-minute intervals over six hours spanning about three yeast cell cycles.

It would be interesting to analyze the similarities between the promoters of all the member genes, which belong to the MLRS GO term, because it seems to be reasonable to assume that the more homogeneous the gene expression pattern between all of the member genes belonging to the same GO term the more correlated must be their gene expression trajectories to one another as well as the overall impact that their Transcription Factor Binding Sites (TFBS) distribution must have on the overall transcriptional regulation of the same group of genes because TFBS distribution and time series trajectory similarities must describe and define exactly the same information and therefore, they can be considered as two very different dimensions to describe exactly the same phenomenon, i.e. transcription,. If this is true, then the relative ranking based on their similarities, whose exact definition, determination, quantification and calculation remains to be defined.

Any GO term, for which it can be shown that the trajectories of the gene expression patterns for all its member-genes are much higher correlated to one another than to the remaining genome, i.e. much higher than could be expected by random chance alone,, could help us to infer the function of genes, which have not yet been discovered. This cannot be assumed automatically for all the GO terms because some of them have too many gene-members to reveal a nicely predictable and reproducible, highly GO-term-specific gene expression patterns with which we could try to match to the most correlated trajectories of the gene expression pattern of genes, which functions are still unknown at this time.

I read some interesting papers, which challenge the currently applied definition for those GO terms that have a few member genes, which trajectories deviate highly from the majority of its member genes by claiming that they must have been assigned to the GO-term, with which their trajectories don’t share higher similarities than could be expected by random chance alone, by mistake and hence make a strong case for such kind of GO terms to be revised. This might be an easy way to get a publication before I will be graduating because my intuition tells me that if I perform the same analysis as I have been for the MLRS gene members, I have a high chance finding a few outliers, which GO-term membership could be questions for the first time and which could therefore be considered as novelty.

Furthermore, I hypothesize that we will find the disproportionally highest percentages of homogeneously expressed genes in the Molecular Function (MF) GO Term ontology when compared to the cellular component (CC) and biology process (BP) ontology because MFs can be conceptualized as smallest and most indivisible functional units, which would lose their overall functionality, if their member-genes would be subdivided any further into even smaller groups because they would no longer contain all member genes, which most work together to form a more complex function unit and whose expression pattern must therefore be highly coordinated and regulated because any of its subgroups working in isolation would contain insufficient member-genes to retain complex functionalities, which can achieve much more when working well together than could be expected when simply adding together the contributions of each of its gene-members considered in isolation.

SPELL (Serial Pattern of Expression Levels Locator) (see http://spell.yeastgenome.org/) uses this approach to infer the function of those genes, which could not yet experimentally verified, in a very similar way as I am proposing, but I think that if I can get my new Linux computer to do what I am envisioning, we have very good chances to predict gene functions that SPELL cannot. SPELL uses 11,449 microarray chips from 386 publish studies. But SPELL appears to consider each of its member studies, on which its conclusions are based, in isolation from one another.

When looking at the thousandths of plots of the time series trajectories, which my many R scripts can generate in a few minutes I noticed that for many of the larger GO terms there seem to be 2 to 3 very distinct expression pattern, which remind me of a decision tree because their expression is very homogeneous within the same expression them but very heterogeneous between distinct expression themes.. That is why I hypothesis that every GO term, which we can easily divide into 2 distinct gene expression sub-themes, has not yet been subdivided into its smallest possible functional units because one homogeneous expression theme indicates that its member genes contribute to the same functional unit, whose expression pattern of its member genes must be very well coordinated and regulated in a similar way to maintain complex functions, which are more than the sum of its parts.

I think we’ll have very good chances to outperform the predictive power of SPELL because instead of only consider each dataset in isolation, we could concatenate each microarray ,CEL file by treating it as a data point of a time series dataset no matter how different the experimental conditions, phenotypes, genotypes and external stressors may be to outperform SPELL, because all we need to care about is the similarity of the trajectory, which connects all the data points no matter how unrelated the studies, from which they have been taken, may be. The more data points we have the more opportunities we can provide any pair of genes to differ from one another under certain conditions. This would increase our sensitivity to detect conditionally functional units, which can only be observed very rarely, because the genes, which must be expressed in a highly coordinated manner, are only required to form functional units in extreme conditions, such as CR, metal and other toxicities, external stressors, media composition, etc. Exploring very rarely conditionally functional units and determining their member genes will be very informative because it can tell us more about still undiscovered options cell may have to respond, which we can only exploit after we have discovered that they exist.

I consider such kind of rarely observed conditional functional units as one extreme end of the spectrum, which can teach us about the probably still very unexpectedly underestimated options of which life can be capable, which is opposed by the imperatively essential functional units at the other extreme of the spectrum, implying that life would be impossible if the expression of their member genes would not be well coordinated because it depends on the MFs to which their member-genes belong. I would even go so far to consider replicates as independent time series experiments and consider them sequentially instead of as a group because I found that datasets without replica but more time points have more predictive power than datasets with fewer data points but lots of replica. This is because when considering replica as a group they cause trajectories to be much shorter, fuzzier and defuse than datasets with no replica but lots of data points. Let’s say I have a dataset with 50 time points and 3 replica my temporal resolution would be much less than if I took the 50 data points from each of its 3 replicates and concatenate them sequentially to one another because that would give me 150 data points of which I could think like 3 cell cycle periods since they must be very similar to each other because actually they were replica. The advantage of this approach is that I can consider periodic and cyclic parameters, which would not be available to characterize differences in trajectories unless periodic parameters, such as period length, amplitude and oscillation patterns could be compared. It is kind of like the same that – instead of considering and evaluating three lives taking place at the same time, I consider them sequentially, i.e. one after another instead of a single unit. This has the advantage that I can triple the opportunities for expression pattern to differ between closely related genes and reduce ambiguous fuzziness if I don’t consider the replica simultaneously.

In a nutshell my thinking is like this:

Replica tend to disguise differences between closely related and are only possible at the expense of having fewer time points of measurements. But the fewer time pints I have less opportunities I can give to my genes to differ from one another. I have not observed that the increase in statistical power and increase in the options to apply statistical calculations and methods can compensate for the benefits, which I have observed from increasing the temporal resolution by a factor, which is equal to the number of replica I could have for each time point.

Let’s take the Tsu dataset from 2005. It’s one of my most favorite yeast microarray dataset because it spans three consecutive cell cycles in 25 minutes intervals. It provides 12 time points per cell cycle and 36 time points all together. This dataset has shown me the power of adding periodic parameters, such as phase shift, amplitude and period lengths to distinguish between closely related groups of genes (or GO terms), between which we would not been able to distinguish if instead of only one replica we’d choose to have three, but at the expense of the temporal resolution, which would cause our time interval between consecutive measurements to rise from 25 minutes for no replica to 75 minutes if we’d choose to have three replica for each time point. I claim this because we know for example that mitosis in yeast can be as short as 30 minutes or even shorter than that. We also know that most of the lipids a daughter bud inherits from its mother must be synthesized within the short time frame of mitosis because there is not enough space in the mother to store all the lipids, its needs to give to its daughter unless it can place it within the rapidly growing bud, which is in the process of developing into the daughter. We also know that transcription and translation must work together to accomplish the very steep spike in lipid synthesizing genes, which cannot last any longer than the 30 minutes of mitosis and cytokinesis. If we’d measure gene expression every 25 minutes, we still have pretty good chances of detecting the very steep but also very brief spikes of lipid synthesizing genes during mitosis. But if we – instead of measuring gene expression every 25 minutes, we’d choose to measure it only every 75 minutes because we’d much rather want to have 3 replicates for each of our time points, we’d face a very high risk of missing the steep spikes in the expression of lipid synthesizing genes, which are limited to mitosis only. If we’d favor to have more replicates for each time points instead of more time points even at the expense of not having any replica we might still not even aware of many essential spikes, which only happen during mitosis. This would prevent us from being able to use gene expression pattern to distinguish between genes, which have too similar gene expression trajectories to distinguish them from one another during G1, S and G2. This would result in the misconception to consider those genes as part of the same MF because we’d mistakenly assign then them to the same GO term because we would not be able to detect the huge difference in their expression if it is limited to the short time period of mitosis only.

The main advantage of the approach to treat transcriptome measurement like a single time point in a cell cycle dataset is that we would not even have to spend the time to read each paper that goes with each dataset because we no longer need to worry about specifics of each study because we are only interested in maximizing the opportunities between the differences of closely related and very similar looking genes but we would not need to worry about the very specifics, which have caused the differences as long as our goal is to group genes based on their expression pattern similarities into indivisible groups of functional units as long as we don’t need to be concerned about the causes, which would explain those differences, which we intend to exploit as a convenient way to quickly create very reliable and highly sensitive similarity measures to redefine our GO terms. The only thing, we must make sure for this shortcut to work as intended is to ensure that for each and every gene the temporal order of data points does not change. We even don’t need to worry about extreme high variations between the time intervals separating subsequent transcriptome measurements because no matter whether we measure every 3 minutes or every 10 hours all what is changing is the slope connecting 2 data points, which tends to be steeper for measurements lying 10 hours apart than only 3 minutes. But since changes in time intervals affect all the genes in the same way they cannot adversely affect our thus created similarity measure generating shortcut at the expense of understanding them but not at the expense of correctly inferring still unknown gene functions as long as we know the function of at least one gene for each similarity group that we want to expand into a GO term. The worst case scenario of such an approach might be that we may not be able to extract any additional information if the time points are too far apart from one another for detecting the very impressively steep but also very brief deviation in gene expression of otherwise almost identically expressed genes if their time points are too far apart from one another for measuring them but it would not harm or counteract the insights, we may extracts from time series datasets with very high temporal resolution and extremely brief time intervals between subsequent transcriptome and proteome measurements.

The same applies to gene knockout experiments consisting only of the mutant and WT because all on which our function predicting power depends on is the variation between the steepness of the slops connecting experimental and control group for each gene. This puts us into a position, which allows us to be very successful while being lazy at the same time.

I’d love to perform the following analysis as proof of principle that this approach is promising no matter how counterintuitive it may look at the first glance.

Affymetrix is making two types of microarray chips for yeast, i.e.

1.) The Yeast 2 chip and

2.) The Yeast 98 chip.

For both of these microarray chips there are about 400 studies about which approximately 100 of them are time series studies.

I’d love to train a supervised machine learning algorithm with training data of time series experiments from the yeast 2 chip and use the yeast 98 chip as testing set to see how many time series studies can be identified correctly for the following reasons:

1.) the predictive power of time series datasets tends to be higher than two-contrast group datasets because we can employ cyclical and periodic parameters, such as period lengths, amplitude and phase shift to distinguish between very similar looking yet functionally different genes that we might not be able to detect without having the additional discriminatory power of these periodicity-depending parameters.

2.) If we can reliably automatically detect the boundaries separating time series from other kinds of datasets, then we’d have the advantage of letting our machine learning algorithm to apply periodic and cyclical parameters only time series but not to other datasets because applying periodicity and cyclical parameters to datasets, which are not time series datasets, we may actually hurt our gene-function predictive power because we would try to read something into our data that cannot exist. This would put us at too high risk of over-fitting and making wrong conclusions. I am confident that it will work based on preliminary results. The challenging part is to get it to work for lots of datasets at the same time.

I’ll probably need to redo all my past analysis, which I have done in R so far, in Python again because when I was still working in R, I deleted all the genes, for which I had duplicates in their systematic name because I did not know of any better way to deal with this problem. But since most other students, who have been advised to handle this problem the same way I did, we’d all fail in discovering any affect on aging, which might be caused by some but not all transcript variant for the same gene. That is why I sort of need to build a unique row identifier, i.e. unique key to identify any tuple (i.e. row) that is obtained from automatically merging gene name with its associated probe-set ID.

Another challenge is that the yeast 2 chip and yeast 89 chip have different probe-set IDs and gene names. I have not yet been able to succeed in coming up with a programmatic matching strategies between yeast 98 and yeast 2 Affymetrix microarray gene transcripts other than deleting any genes, for which I’d encounter duplicates in the systematic gene names. But I feel reluctant to do that because I feel that would defeat the purpose of my analysis since that would prevent it from being any better of the extensive work on these issues that has been completed by so many much faster and smarter students than me.

I am just wondering how this gene name duplication problem, which refers to different splicing variants, is handled in human and cancer genome analysis since humans make much more use of alternative splicing than yeast. I still cannot understand we nobody seems to propose to stop adhering to the obviously obsolete and misleading misconception, which is inherently inevitable when making a gene the smallest indivisible unit to which a single name identifier can refer. Conceptually, for me, each exon or transcript ID should be treated as if it were a totally independent gene because the fact that 2 exons where spliced out from a single connected strand of immature mRNA does not make them any more similar to one another when compared to two transcripts, which originated from different chromosomes. I guess the long overdue reluctance to break away from obviously wrong and obsolete concepts may be due to historical reasons because probably we discovered genes long before we could suspect any concept of exons and alternative splicing variants. This resulted in the development of a nomenclature, which allowed referring easily to genes, but not to each individual exon or splicing variants, into which each gene could be divided upon as soon as humanity discovered that each gene could be further subdivided into independently functioning units, to which we now refer to as “exon”.

I strongly feel that exons should be treated like and referred to in the same way as we deal with adjacently coded neighboring genes because the only difference between them, which - by the way - should not affect gene function, is that adjacently coded genes have not been transcribed onto the same strand of premature mRNA, whereas exons sharing the same gene identifier have been transcribed initially onto the same immature mRNA strand during the very early steps of transcriptional modifications.

The following line of reasoning tells me that holding on to the obsolete concept of a gene being the smallest relevant functional unit, which must be accounted for in our analyses is raising our risk for missing essential details unnecessarily, is that I have not yet met any rationally thinking scientist claiming that the 10th and 11th gene, which is coded for on the same chromosome III are functionally more related and similar to each other than the very first, i.e. gene 1 on chromosome III and very last gene, e.g. gene 526 on chromosome III, which are coded for as far apart from one another as they can possibly be while still being associated with the same chromosome as long as the two adjacently coded genes, e.g. gen 10 and 11 on chromosome III, don’t share promoter elements. But, if it is true, that the intra-chromosomal distance between genes has no effect on similarities between nucleotide sequences and gene functions, why should exons, which originated from the same immature mRNA transcripts should be more similar in their nucleotide sequence and function to one another than genes separated from one another by the maximum number of nucleotides, which is possible any chromosome? The only exception to this argument could be gene duplication, which is much more likely to have occurred between adjacently coded genes than between genes coded for at the very extreme end points of the same chromosome. Genes, which have arisen from one another by gene duplication events tend to be more similar to one another in their promoter and nucleotide sequences than genes coded for at the most extreme opposite ends of the same chromosome. But for all other adjacently coded genes, which have not arisen from gene duplication, it can safely be assumed that they are no more similar to one another than genes, which are coded for as far apart as they can be on both extreme chromosomal ends of the same chromosome.

It is true that exons of the same gene, which originated from the same immature mRNA transcript, must have been initially transcribed while under the control of the same promoter region, but the fact that the selection of exons and their order has been determined by a totally different and unrelated mechanism, i.e. alternative splicing, it is reasonable to expect that alternatively spliced exons would cause their co-occurrence on the same mature mRNA, which gets translated by the same ribosome, to be no more likely than could be expected by random chance. In fact, the term “mutually exclusive exons” refers to any pair of exon, which has never ever been observed to be translated by the same ribosome despite having originated from the same immature mRNA transcript. This tells me that our long outdated nomenclature to refer to transcripts must be updated until it allows us to easily and unambiguously distinguish between all possible imaginable differences between transcriptional fragments affecting any kind of ratios between protein abundances.

Furthermore, I’d like to propose a new nomenclature and method for ranking GO-terms based on similarities between their member genes because this would help us in distinguishing between GO terms, which are well-suited for inferring still unknown gene functions and other GO terms, which should never be used for making functional inferences for any gene, for which no function could be proposed so far. In a nutshell, any GO term, whose members share much more correlated gene expression patterns, which could be expected by random chances alone, a very good candidates to help us to infer novel gene functions based on much higher similarities between the trajectories of the gene expression pattern between all of the genes belonging to the same GO term. On the other hand, I have also seen a few GO terms, whose gene expression patterns differ more from one another than could be explained by random chance alone. We must be aware of such kind of GO terms, which are composed of genes that appear to have much less in common with one another than could be explained by random chance alone.

Finally, the selection of datasets, based on which the time series curve trajectories have been plotted, may be used to make a final decision whether a particular GO term should be used for making prediction based on trajectory similarities of its member genes, because – for example – the genes involved in oxidative respiration, i.e. TCA and ETC, might be good candidates to infer novel gene functions as long as oxygen is present, but maybe completely meaningless as soon as no more oxygen is available, i.e. under fermenting conditions, because then all genes needed for oxidative respiration have been turned off; thus, causing their gene expression trajectories to resemble flat horizontal lines, which are parallel to the X axis, indicating that those genes are no longer expressed under fermenting conditions. Such kind of horizontal trajectories, which are parallel to the X-axis, could easily be mistaken for being even more highly correlated to one another than biological replicates because 0 = 0 for all genes, which have been turned off. Unfortunately, any GO term, whose member-genes have been turned off completely, must never be used for inferring any kind of similarities because they would almost inevitably contribute to the illusionary misconception that the trajectories for all GO terms, for which all member genes have been completely turned off, would be best suited for making functional inferences because mathematically their time series trajectories would inevitably lead to the erroneous illusion of perfect correlation when analyzed numerically without understanding that any kind of similarity prediction is only possible for genes, which expression changes significantly over the time span of observation, because without such kind of changes time series trajectories lack the opportunity to reflect differences in the gene expression pattern between genes based on which they can be grouped into different functional units. This implies that any GO terms having two or more adjacent time points with no gene expression should never be used as a basis for making functional and other prediction based on similarities between because a slope of zero connecting the two or more adjacent time points below the mRNA detection limit could misrepresent perfect correlation between any pair of genes, which share two or more adjacent time points of no gene expression at al. This is because all adjacent time points with no gene expression will inevitably be connected by a horizontal line, which is parallel to the X axis, i.e. which has a slope of 0 and which is below the mRNA detection limit. If two or more genes share the same adjacent time points of no gene expression, they would inevitably give the false impression of perfect correlation; and thus, could trick data scientists into selecting exactly the wrong GO terms for making functional inferences, which is inevitably doomed to fail for the following reasons:

To better understand it you must know that the yeast 2 microarray chip from Affymetrix has a little over 10,200 probe-set IDs. However, only 5,117 of these probe-set IDs are for genes of the baker’s yeast and the remaining 5,000+ genes are for the fission yeast plus a few extras, which are Affymetrix quality control probe-set IDs. Most of the microarray datasets, which I have analyzed so far, were for baker’s yeast only. For the fission yeast, no gene expression is possible, because no fission yeast DNA was available to serve as a template from which fission east mRNA could be transcribed. Unfortunately, gene expression analysis of microarray and RNA Seq. data is still far away from being an exact quantitative and reproducible science because its data is still very noisy and seems to depend on money still unidentified confounding variables, which make it still very challenging to correctly discriminate between noise in the system and legitimately true changes in mRNA abundances, which can be clearly unambiguously and confidently attributed to changes in experimental conditions. This adds to the challenges encountered when comparing gene expression between different studies, especially the mRNA abundance has been determined with different microarray chips from different manufactures. Some authors claim to have found a solution, which allows to properly account for the differences in mRNA measuring methods, whereas other authors warn from the risk of getting tricked into making false conclusions when calculating expression ratios between different genes from the same study based on their intensities because their nucleotide sequence tends to have more impact on the measured intensity than the abundance of mRNA strands, which has formed hydrogen bridges between complementary nucleotide sequences for the about 50 base-pair long probe-sets and the mRNA fragments, which has annealed to them. This group of authors claims that ratio-based comparisons can only be safely made between different time points or experimental conditions for the same gene, but not across different genes even if they came from the same study. But if this was true, then my proposed approach to infer gene functions based on trajectory similarities between different genes from the same dataset would fail for reasons explained above. But if this was true I should not have been able to obtain the nicely looking and extremely highly correlated trajectories for the approximately 40 genes, which below to the MLRS GO term. This can be tricky to figure out, because sometimes certain approaches, which seem to completely confirm our experimentally observed observation sometimes may utterly fail in other circumstances for reasons, which still remain to be discovered.

Maybe the reason why we can observe such confidence boosting time series trajectory correlation plots for all of the approximately 40 member-genes of the MLRS GO term could be due to the assumed fact that the measured intensities can much better reflect differences in mRNA abundance ratios between all of its MLRS abundance ratios because their probe-set nucleotide sequences are so similar to each other that they hardly disturb the true mRNA abundance ratios. However, this may not be the case of GO terms, which are made up of gene-members, which possess much more diverse nucleotide sequences, which tend to numerically over-shadow the mRNA abundance ratios because of their much more disturbing effects, which their much more diverse nucleotide sequences may have on the intensity measurements, on which any numerical data analysis method must depend.

I remember reading in some papers that the current accuracy of RNA Seq nucleotide sequences is only between 95% and 99%. This implies that RNA Seq can be expected to report a wrong nucleotide at least one time for every 100 bases. The Affymetrix Yeast 2 microarray chip has about 5,117 probe-sets for the baker’s yeast. To make it simple let’s assume that each yeast gene is 1,000 base pairs long on average. For the approximately 5,000 yeast genes with an average assumed length of 1,000 base pairs yields 5,000 genes times 1,000 bases = 5,000,000 protein-coding nucleotides. Assuming that my sequencer only gets one out of 100 bases wrong, I’d have 5,000,000 / 100 =50,000 errors. In my opinion this is way too much when considering the following analysis

Let’s imagine I am assigned the task to figure out why one programming code is running whereas the other, which is supposed to be the same is not running at all. Let’s assume both programming codes consist of exactly 5,000,000 coding characters. Let’s further assume that the only method, by which I can compare both codes to find the bug can only be viewed side by side if I make a copy of the existing code because it cannot be directly observed by any method. Lets further assume that when making a copy of the running and defective code in order to view and compare them side-by-side to fix the bug, one out of every 100 code characters randomly changes into another code character without any option for tracing, which of the coding characters has changed. I chose this example because it closely resembles the lowest possible RNA sequencing error rate of about 1%. I chose the constraint of the running code to be completely inaccessible and that the only way to view it is to make a copy of it but that copying it randomly chances one out of 100 characters of the programming code, which consists of exactly 5,000,000 string type characters because our only option to get an idea of what the truly existing DNA sequence may look like, we must amplify it by making lots of copies of it, but for every 100 bases we copy, one changes randomly, but we have no way of telling which one.

Now comes the scary part, which confuses me because I have not met anyone, who seems to worry about the implications and conclusions of this little mathematical example and its analogy to programming code.

Let’s say that I can only get paid for my work if I can find and fix the bug in the no longer running code, I must make a copy of the functioning and dysfunctional code, both of which consists of exactly 5,000,000 characters and that the lowest possible error rate, which I can hope for, when copying the functional and dysfunctional programming code to be viewed side by side is at least 1%, then I have at least 50,000 errors caused by random coding character substitutions after I made a copy of my functional and dysfunctional code. This implies that I am confronted with the ridiculous task to find and correct the mistake, which prevents the dysfunctional code from running properly, considering that I am fully aware that at least 100,000 random errors must have occurred by random character substitution in the process of making copies of both, the functional and dysfunctional programming code?

Can you imagine anyone, who would accept a $10,000,000 contract to find the error in the dysfunctional code after at least 100,000 errors were induced by random coding character substitution, which has already taken place before even being given the chance to look at the programming code? Anyone, who is not insane, or who strives when being challenged by very complex problems, for which the chances to find the right solution is much slimmer than winning $10,000,000 in a lottery. I’d play the lottery for a chance of winning $10,000,000 because it’s not much work but I would refuse to find the bug after 100,000 errors were randomly introduces, even if someone would offer to pay me $100,000,000 if I could fix the bug in the dysfunctional programming code. But this is exactly what we are doing according to my understanding of RNA Seq. Lets imagine that instead of being assigned the task to fix the bug in the code, I am assigned to determine the difference in the nucleotide sequences, which causes a dysfunctional individual yeast cell to die early, whereas another long-lived mutant is thriving on a 30 times longer lifespan. Can you imagine anyone to take up this challenge? Apparently, most bioinformaticists do exactly that when drawing conclusions between cause and effect from large genomic datasets. However, if my little analogy above is not flawed, how confident could one feel about any conclusions drawn from data given the constraints described above?

I wish I’d know RNA Seq. because I could better understand the reasons for the analogy described above and I’d have much better chances to find a real job, because almost everyone expects experiences in analyzing RNA Seq. datasets, but yet, there is no real course, which teaches how to do such kind of analysis properly. I learned microarray analysis by going through countless iterations of trial and error. The same seems to apply for analyzing time series data, promoters and for whatever job I might be able to get because many of the tasks listed in the job-description I have done yet. But I also found that when learning from others microarray analysis instead of going through all the trouble of reading to find out on my own, everyone seems to have a slightly different way of analyzing genomic data. This could cause different results and different conclusions drawn based on such kind of differences in analytical methods. But isn’t the objective of science to discover differences, which are not due to variations in the analytical techniques?

I have been struggling through lots of iterations of trial and error since I started learning about analyzing time series datasets because the very first paper, which I was assigned in August 2016, was published in 2015 by Janssen et al claiming that aging in yeast is caused by the divergence between transcriptome and proteome as the yeast ages. It came with a time series dataset consisting of 12 time points connected by 11 slopes for 4,902 genes at about 1,500 proteins. For many months I was hunting a phantom because I thought I could find the right method of clustering my time series plot trajectories correctly according to their similarities to each other. That was much easier said than done because I could not decide on any method of determining trajectory similarities if I were given the task to find a way to rank them according to the similarities of the gene expression pattern. My very first approach was to misuse GEDI (The Gene Expression Dynamics Inspector) (see https://apps.childrenshospital.org/clinical/research/ingber/GEDI/gedihome.htm) as time series trajectory clustering tool without understanding the exact steps, GEDI was taking to determine similarities between the trajectories of the uploaded time series datasets. For example, I was wondering why GEDI kept suggesting much higher similarities between fission yeast genes than baker’s yeast gene considering that no fission transcripts have ever been measured or reported for all the baker’s yeast datasets, which I have analyzed so far. It took me some time to realize that any group of genes, which share at least to adjacent data points of no expression must be excluded for the analysis because when considering that fission yeast mRNA was never detected or reported at any time of my analysis of baker’s yeast mRNA, then it would be reasonable to assume a perfect correlation between the time series trajectories of all the fission yeast gene because ideally their trajectories would resemble even higher correlations than can be expected for biological or technological replicates because 0 should be exactly equal to 0, but this obvious seeming assumption could not be confirmed by my analysis yet.

What makes me seriously wonder is the unexpectedly huge range between no gene expression when can be as low as 3 log transformed 2 but as high as 8 log transformed 2. The log transformed 2 calculation results in a fold change of two between consecutive integer values. So if my log transformed 2 values increase by one, let’s say from 3 to 4, then this implies that my gene expression has doubled. Increasing from 3 to 5 would have quadrupled it, from 3 to 6 would have raised it by a factor of 8, from 2 to 7 would be equivalent to a 16-fold fold change (FC) and 3 to 8 would still mean absolutely no gene expression despite the unexpectedly high variation between instances of no gene expression, which seem to range from no variation up to a 32 FC variation between all fission yeast data points, which share with one another that no gene expression was ever observed, recorded or measured for them. What could caused such a high variation ranging from no FC up to an FC of 32 between all fission yeast genes, which must be considered as being more than 5,000 (because the fission yeast has a little more than 5,000 genes) very diverse instances and yet describe the same conditions, i.e. absolutely no fission yeast gene expression whatsoever.? But what in the world could cause 32 times higher gene expression values to mean exactly the same thing as 1/32nd of the maximum still zero expression indicating numerical values below the minimum mRNA detection limit of about 8 log transformed 2? Answer: INTENSITY DIFFERENCES DUE TO VARIATIONS IN PROBE-SET ID NUCLEOTIDE SEQUENCES!!!! I finally figured it out! I was astonished because even so there cannot be any differences in fission yeast mRNA abundances, the unexpectedly wide variance in FC by at least a factor of 32 must then be caused by differences in probe-set ID sequences because we could safely rule out the possibility that differences in mRNA concentration could have possibly contributed to the unexpectedly high FC of at least a factor of 32. But if differences in probe-set nucleotide sequences could change the measurements by a factor of 32, then we must clearly expects to face the risk that FC below 32 in mRNA abundance could be over-shadowed by the noise in our data, which is not related to mRNA FCs, in which we are interested, due to the unexpectedly high differences by which different nucleotide sequences can affect the measured intensities.

But this could also imply that our numerical values can only resemble the objective truth very vaguely because a transformed 5 log transformed 2 could imply implies that the true mRNA concentration could be up to 4 times less than our measurements would indicate or even 8 times more than our measurements would make us believe. But if differences in the probe-set ID sequences, can impact intensity in such unexpectedly powerful profound ways, how can we ever succeed in figuring out, which portion of FC is not due to difference probe-set nucleotide sequences, but instead reflect true transcriptional FCs, in which we are actually interested?

From this train of thought I conclude the concerns of data scientists, who advise against numerically calculation ratios between different genes from the same dataset, are apparently very real and legitimate, because I would not trust any data, from which I know is so noisy that it can disguise true mRNA FC of up to a factor of 32. Given that, who would still feel confident about being able to legitimately draw scientifically sound, robust, reproducible and valid conclusion from such noisy data? But now I have an idea how this problem could be easily overcome!

It’s so simple that I am worried about missing some important aspect. We could simply make a microarray analysis without ever adding baker’s yeast mRNA to hybridize with their complementary probe-set ID samples. Then we’d measure the intensities for each never expressed probe-set ID. This would allow us to score and rank the impact, which differences in nucleotide sequences, may have on the measured microarray intensities.

Let’s say that the measured expression for Osh6 is 8 times higher than for Osh5, despite the fact that neither of both genes has ever been expressed, we could safely conclude that the measured intensities, due to differences between the nucleotide sequences between Osh6 and Osh5, induce 8 times more noise for Osh6 than for Osh5. Lets further assume that this factor of an 8 times FC changes a lot when analyzing real Osh6 and Osh5 mRNA, knowing that the probe-set ID sequence of Osh6 increases the measured intensity 8 times more than 5 Osh5, we could correct for this now exactly numerically quantified the 8 times stronger affects, which the sequence of the Osh6 probe-set ID has on the measured intensities, and deviation in FC between Osh5 and Osh6, which is not equal to 8, must be caused by differences in mRNA abundances because we have already succeeded in quantifying and accounting for the 8 times higher noise in Osh6 verses Osh5 probe-set ID nucleotide sequences, which allows us to confidently attribute any FC between Osh5 and Osh6, which is not equal to 8, to be caused by differences in mRNA concentrations but not by nucleotide sequence-specific effects on the measured intensities since we have already accounted for them by measuring the intensities for all of the baker’s yeast probe-sets without adding any mRNA.

Is this kind of background noise canceling procedure already commonly implemented in standard normalization procedures, such as Min-Max normalization, quantile-quantile normalization, row-sum normalization, RMA normalization, etc. without me being aware about it because so far I got by with selecting the same normalization methods are my peers to replicate their results without understanding every single step of the different kinds of normalization methods, which I decided to use because much better students than me told that they used them for the same reason as me, somebody told them without understanding why?

More Thomas Hahn's questions See All
Similar questions and discussions