How can I use similarities between the trajectories of gene expression time series plots to infer the same similarities between gene functions as if I had used TFBS distribution information or gene expression distance matrices derived from them? How are similarities between mRNA gene expression pattern properly determined? What is the best way to predict or infer still unknown gene functions based on similarities or correlation measures between time series plot trajectories?
TFBS stands for Transcription Factor Binding Sites, which are promoter-specific for each gene. A promoter is controlling gene expression. Gene expression measures the mRNA concentration at each time point. The mRNA changes over time can be defined or inferred by two methods, which convey the same information by using 2 different dimensions:
1.) The Time Series Plot Trajectories use the Y value for each time point to infer how much mRNA was present at this time
2.) The TFBS determine the level of transcription because it is regulated by the Transcription Factors (TFs), which bind to the TFBSs.
So actually the TFBS are the cause and the trajectory of the time series plots is the consequence. But both refer to the same event but the describe and quantify it by completely different means / dimensions.
TFBS distributions must be defined by the nucleotide sequence (A, T, C, G), to which a particular TF (mostly a protein) binds and spacial indicates telling distances expressed or measured in the number of nucleotides separating them. But some TFBS overlap one another. The exact TFBS distribution is probably still subject of debate for many promoters.
Some extremely smart genomic scientists must have agreed on a method to rank the similarities of gene expression patterns (i.e. transcriptional similarities) by converting the TFBS distributions between promoters in distance matrices following a complex algorithm I cannot understand. However, it allows the put the genes in an order that their ranking reflects the similarities of the impact that the different promoter-specific TFBS distribution of on transcription. I refer to this order as the true similarity between genes.
Now since the same information is encoded in the trajectory of the time series plots of transcription levels I was trying to find a way to rank the trajectories of the time series plots in such a way that the relative similarity order between genes is the same as if I had used TFBS distribution information.
If I had succeeded I could have used the trajectories of time series plots, which is much easier to use, to predict the same kind of similarities between genes than if I had used TFBS distribution information or distance matrices derived from TFBS information.
There was an R package, which claimed to be able to achieve this, but despite trying for long time I could not get it to work. Are you good in R and Python? These are the only two programming languages I know.
The task my adviser gave me to figure out as part of my dissertation was to predict the function of genes, which we don't know yet, using similarities between the trajectories of time series plots based on the assumption that - the more similar the gene expression pattern between genes, i.e. the higher the correlation between gene expression patterns - the more similar are the functions of those genes.
My problem is that there are so many different mathematical methods to determine, compare and rank the similarities and to calculate the correlations between the trajectories of time series plot each of which yielding a different relative similarities between genes. For a long time I was trying to figure out the best way to determine, which way of similarity calculation would be best until I realized that there is no right or wrong, or better or worse way to calculate the similarities / correlations between time series plot trajectories. I realized that the relative order of similarity rankings between genes must be the same as if these genes were ranked based on their true relative similarities, which I am supposed from trajectories of time series plots. Since TFBS distributions are the most direct measure of expression patterns, whichever sorting method puts my time series plots into the same relative order, which I would have gotten if I had based my similarity sorting on TFBS information.
This is how far I was able to follow the train of thought of my adviser. What I could not figure out is how to actually determine the similarities between temporal gene expression pattern because again there are many ways. One can use Pearson Correlation, Time Wrap, phase shift, periodicity parameter and many other parameter by which certain types of trajectory properties can be described, compared and ranked. Now the problem is that the expression pattern of some genes is considered periodic whereas others are not. So if I used the period length, suddenly I'd have N/A values for all the genes that lack a periodic expression pattern. That is where I got stuck and why I could not include any of this in my dissertation. Nobody could explain me how to measure, compare and rank time series plot trajectories.
Do you have any idea? Does this task even make sense? This is how far I understand it but I cannot figure out how to solve it in such a way that people would be happy with my solution, Please help if you can.
Please somebody help fast with detailed explanation because if I can figure this out within the coming week i can still include it in my dissertation. In case I can I have a realistic chance to graduate this fall semester. I must either graduate or die because my GA funding did not get extended into this academic year causing me to starve because I have no income and don't know where to go.