The key to the keys to immortality and eternal youth lies in the correct answer to the main question: How to naively discover new essential – but still hidden – features required for properly training novel adaptive supervised machine learning algorithms (SMLA)

Chapter 1: Introduction to feature discovery to train supervised machine learning algorithms also referred to Artificial Intelligence (AI)

Feature discovery and selection for training supervised machine learning algorithms – an analogy to building a two story building

Imagine a building named “Aging”. It consists of two stories: the ground floor, which is called “feature selection”, and the second floor, which is called “developing, optimizing and training the machine learning algorithm”.

Before any algorithm can be trained properly, feature selection must be perfect and completed. Otherwise, the machine learning algorithm may learn irrelevant things caused by the ambiguity, which is due to missing features. Not much time and efforts should be invested into optimizing, training and improving the algorithms until all features are properly selected. As long as feature selection is incomplete one must focus on finding the missing features instead of tuning the algorithm.

In other words, using our building analogy here is the most important advice. Don’t try to complete and perfectionate the 2nd floor called training, tuning and optimizing the machine learning algorithm, before you are certain that the ground floor, i.e. feature selection, has been fully and properly completed. If this is not the case, one must focus in discovering the missing input data feature first.

Lots of research has been dedicated to perfectionate algorithm before completing feature selection. Therefore, our algorithms have gradually improved whereas our feature selection has not.

=========================================

How can missing features be discovered?

If the algorithm cannot be trained to make perfect prediction it indicates that essential data input features are still lacking. When the predicted values fail to match with observed measurements despite tuning the algorithm it means features selection is not yet completed. This is the case when the error between predicted and observes values approaches an asymptote, which is not equal 0. The prediction error is most likely caused by a still hidden object. This hidden object is the cause of the error. But we cannot see the hidden cause yet. However, we can see its consequence, i.e. the error. But since every consequence must have a cause, we must start looking for it.

Let’s take protein folding prediction as an example. Only about 30% of the predicted folding patterns are correct. We must then go back to the last step, at which our prediction still matched reality. As soon as we get deviations we must scrutinize the deviating object because most likely it is not a single object, but instead, 2 or more objects, which to us look still so similar that we cannot yet distinguish between them. In order for an object to no longer remain hidden, it must have at least one feature by which it differs from its background. Let’s take air as an example. Air in front of background air still looks like nothing. Even though air is a legitimate object, it remains hidden as long as it’s surrounding background, to which we compare it, is also air. Air has no feature that could distinguish it from background air. Legitimate real objects, which lack any feature, by which they could be distinguish from their background, are called imperatively hidden objects because no kind of measurement can help to distinguish them in any way as an object, which is something other than its background. If objects like air, uniform magnetic field or gravitational force are omnipresent and uniformly distributed they remain imperatively hidden objects because we have no phase that is not the object, which would allow us to distinguish it as an object. An object before a background of the same object remains an imperatively hidden object unless we find an instance, which varies from the object, maybe in strength, because we need something to compare it with to identify a difference.

The only way by which an omnipresent uniform hidden object can be discovered is if there is some variation in strength or an instant of non-object. Otherwise it remains imperatively hidden because it cannot be distinguished from itself. Therefore, in order to uncover imperatively hidden objects, we must intentionally induce variation in surrounding environment, methods and measurement methods until we find a feature by which the object can be distinguished from its environment and/or other objects.

If we can advance our conceptual understanding of air being not the same as its background because of wind, which causes a variation in resistance by which air slows down our motion, air still looks like air. But we know that air consists of at least 4 very different groups of objects, i.e. 20% oxygen, 78% nitrogen, 1.5% helium and 0.5% carbon dioxide. Now these 4 objects are no longer imperatively hidden but they appear like a single object. When trying to predict molecular behavior we’ll get errors because what we think of as one object is actually at least 4. By looking, sonar sounding, radiating, magnetizing or shining light on air, we cannot distinguish the four objects from one another yet. But if we start cooling them down gradually, suddenly we can distinguish them from one another by their different melting temperatures.

Looking, sonar sounding, radiating, magnetizing, light shining and changing temperature are considered variations in measuring methods. Below – 80 degree Celsius the feature aggregate state for nitrogen is liquid, which differs from the remaining still gaseous objects. Therefore, we can conclude that the liquid must be a separate object from the gaseous. Thus, we have found a feature by which it differs from the rest. Hence, we took away the imperative hidden nature by changing environmental conditions, i.e. temperature, until nitrogen different in one of its features from the rest. There was no data, which could have told us in advance that gradually lowering the temperature would expose features of difference between the objects. That is why we must become creative in finding ways by which we can vary environmental conditions and measurement methods until hidden objects differ from one another or their environment in at least one feature, which we can measure.

=========================================

Why are data and numbers only crutches?

Data and numbers are only crutches, on which most of us depend on way too much in determining their next steps. We tend to refuse exploring new options of variations unless data points us to them. But the naïve imperative hidden object discoverer has no data or information to infer that changing temperature would expose a feature for distinction, but shining light, radiating, sonar sounding, fanning and looking would not. The new feature hunter must simply use trial and error. He must follow his intuition because directing data is lacking. It won’t help him to perfectionate the resolution of his measuring techniques as long as he has not changed the environmental conditions such that objects differ in at least one feature from one another and/or their surrounding environment. In such cases heuristic search options are lacking. There is no method that tells how and what to vary in order to expose new features for novel distinctions between former hidden objects.

===========================================

What are the great benefits of highly hypothetical and speculative assumptions?

It is not justified to refuse considering even the most hypothetical and speculative theory or assumption for testing because what is the alternative? The alternative is not to vary features. But without it, no improvements in feature selection is possible. It is still better to have 0.00001% chance of the most hypothetic assumption to change the conditions such that a distinguishing feature gets exposed. Hypothetic and speculative are very good but not bad things because they imply changes in feature selection which is always better than keeping the status quo regarding selected training features. That is why even highly hypothetical assumptions should not be frowned upon but instead they should be very seriously tested. Even if most of them will eventually get disproven it means progress because of disproven hypothetic assumption and causal relationships for which we could not find distinctive features until eventually only variations, which cause distinctive features to emerge, remain.

Instead of getting punished, researchers and students should be encouraged and rewarded for developing and testing speculative hypothetical assumptions because they require inducing variations, which have the potential to expose object distinguishable features. If data-driven approaches direct our attention to vary specific conditions in certain directions, well and good. But if not, we must not stop varying conditions, just because no data can give us directions.

Numbers and calculations are only one out of many tools to uncover imperative hidden objects. They tend to work well when available. But that does not mean that we should refuse exploiting alternative discovering methods, such as intuition, imagination, visions, dreams, trends, internal visualizations, analogies and other irrationally rejected feature discovering methods, which are completely number independent. We should explore these non-data-driven numerically independent methods of variation determinations for directional environmental changes at least as seriously as the numeric ones. Otherwise, we unnecessarily deprive ourselves to keep making progress in feature selection in the absence of numerical data. Of course intuition and data should not contradict one another. But no option should be discarded because it is erroneously believed as being “too hypothetical or speculative”.

For example, in order to uncover the magnetic field from a hidden to an observable object it takes lots of trial and error variation of the kind I’ve described above. One must have magnets and iron objects before one can observe the consequences of the still hidden magnetic field cause. Once we have the consequences, we can use the feature and measurement variation methods analog to those outlined above to hunt for the still hidden cause.

Let’s apply these variation methods to protein folding prediction. If our prediction accuracy is only 30%, we must scrutinize the product, because most likely, it’s not one but more than one different objects, which to us still look the same.

Apparently, although they all look like the same protein, they obviously cannot be the same objects, because they differ in one very significant function-determining feature, i.e. their overall three-dimensional folding. This makes the actually imperatively different objects. Objects are considered to be imperatively different, when it has become impossible to devise a method of measurement or distinction that could erroneously still mistake them as only one object.

In case of proteins, we are unnecessarily limiting ourselves to the dimension “protein” because actually the low folding prediction accuracy implies that – despite them sharing the same primary amino acid sequence – they must be considered as different versions of a protein, which differ in their feature “three-dimensional folding” from one another. If objects differ in at least one feature, they must no longer be considered as the same, but as distinctly different objects.

As we know protein folding affects protein function. So what could have been the evolutionary advantages that caused proteins with different folding patterns to evolve? Here, we have no data. All we have is our intuition and insights, which work surprisingly well, if we’ll stop refusing to develop and apply this insight-based method much more readily and confidently and stop viewing it to be inferior, less valuable and reliable that data-driven predictions. If I can tell a difference between two objects, they can no longer be the same but instead must be accounted as two different objects, which should no longer be considered as one. E.g. a blue and a red dice are two different objects of the kind dice but they can never be the same objects when observed by the human eye. These are then inferred as imperatively different objects. The same applies to proteins of different folding shapes even more so because not only do they differ in their feature 3-D-folding but also in their feature “function”. Hence, they can no longer be considered as one of the same kind.

As it seems to be the case for all initially hidden objects, we seem to observe the consequence (folding difference) before its still hidden cause. To find the cause there must be a hidden object or transition during translation, which makes identical mRNA molecules to fold up in different ways after translation. Where could this be useful?

For example, too high concentration of geronto-proteins shorten lifespan. But too low concentration of the same geronto-genes could interfere with maintaining life-essential functions. That is why their concentration must remain within a very narrow window, too narrow to be adhered to transcriptional regulation alone. There are too many variables, which can affect how much protein is getting translated from a certain mRNA concentration. Hence, instead of a front-end (i.e. transcriptome) we need a back-end, i.e. protein folding dependent functional adjustments. Such kind of a much more sensitive and much more autonomously functioning enzymatic reaction speed adjustment mechanism could work like this:

If the substrate concentration is high, the nascent protein can bind a substrate molecule in its active site even before it has detached from the ribosome. But while still in the process of getting translated, no activator can bind to the protein alosteric binding site. This time is long enough for the protein-substrate complex to function thermodynamically like one molecule and fold to reach its lowest energetic state. After the protein detaches from the ribosome, the co-factor can bind, the protein cuts its substrate, but it remains locked in the same folding state as it was when it still formed a molecular folding unit with its substrate.

If protein concentration becomes toxically high the yeast wants to turn off all the mRNA coding for the protein, which is about to rise to toxic levels. Degrading mRNA takes too long. It is much easier to figure out a way to make the excess toxic proteins to fold into an enzymatic inactive state. This can easily be achieved because enzymatic hyper-function lowers substrate concentration. This causes the nascent protein to fail in binding a substrate while still getting translated. The missing substrate causes this protein to have a different lowest energy state and accordingly folds in a different way as if it had bound its substrate. But this is an enzymatic non-functional folding arrangement. The co-factor cannot bind to its alosteric site. Thus, the upwards trends of the toxically rising protein is already getting reversed. But to make this directional concentration change even stronger, the enzymatic inactive folding state allows for a repressor co-factor to bind at its other alosteric binding site. This causes this protein to change its conformational shape again so that it can use its now exposed DNA-binding domain to bind to the promoter of exactly the same gene, which is coding for it; thus, inhibiting its further transcription.

As you can see I have not used any data but only intuition to develop a hypothesis, which can be tested. And no matter how highly hypothetic and hence unlikely this crazy and highly speculative hypothesis above may seem, in order to test it, I must change the environment and measurement methods in many ways. This increases my odds to discover an environmental condition and measurement method pair by chance that allows a totally unexpected distinctive feature to emerge. That is exactly what we want.

Different scientists have different gifts, which are great in some, but completely worthless, in other situations. Most researchers tend to feel much more comfortable in employing driven numerically reproducible analytical methods. However, a few like me, enjoy intuition-based prediction based on story-telling, imagination-like visions predictions, which are the best option we have for any domains, for which we cannot generate numerically reproducible data yet.

But since we tend to favor numerical over intuitional based prediction methods, dimensions within which we can qualitatively, but not quantitatively, distinguish from one another remain underexplored or even completely ignored, because no researcher dares to admit that his conclusions are not based on numbers.

We tend to have the misconception that numbers cannot lie. Nothing is further from the truth! Numerical methods, if employed under pressure to produce results, can be deceiving.

======================================

Chapter 2: What is the ideal life cycle of a new machine learning algorithm?

But I want to talk about something more important. I want to answer the question what’s the ideal life cycles of developing, training, tuning and optimizing a machine learning algorithm.

There is always a need for better algorithms. As we discover more relevant features, according to the methodology described in the previous chapter, we indeed need better and more comprehensive algorithms to account for them. So we’ll use trial and error and hopefully also some intuition and parameter tuning to improve our F-score. We will again approach an error asymptote, which is greater than 0 eventually. But even if we get perfect prediction, this should not be our main final objective, but only means to an end to unravel still hidden objects. Are work is not done when we have reached perfect prediction although it implies proper feature selection. But we are never satisfied. As soon as we have the ideal machine learning solution, we want to create conditions, which will cause our tool to fail. Why? The reason why we are interested in forcing our algorithm to fail is because we want to explore situations when the assumptions of our algorithm are no longer met. For such kind of situations, we’ll have more or different essential features that must account for new circumstances, connectional innovations and perceptional changes in perspectives adequate for addressing a more complex situation, which has previously not yet been considered.

For example, when I started my dissertation I thought that there are only 3 kinds of aging regulating genes:

  • Life-span-extending genes (i.e. aging suppressors)
  • Life-span-shortening genes (i.e. geronto-genes)
  • Genes, which don’t affect lifespan.
  • Dr. Matt Kaeberlein’s lab kindly gave me lifespan data for most of the possible gene knockout mutants. Caloric Restriction (CR) extended lifespan in WT, but shortened it in Erg6 and Atg15 knockouts. The generalization that CR is a lifespan extending intervention suddenly no longer held true for both of our knockouts. Tor1 and Sch9 knockouts lived about as long as WT in CR. Hence, on normal 2% glucose media (YEPD), they are functioning like aging-suppressor-genes, but during CR, they are functioning like non-aging genes. This would have caused every machine learning algorithm, which only assumes that an intervention can shorten, lengthen or have no change on lifespan, to inevitably fail, if the genotype feature is not given as part of the training data too. This causes genotype and intervention to become an imperative pair, whose members must not be considered in isolation, when training more predictive algorithm.

    Let’s say I only train my algorithm on WT data to classify into 3 broad categories, i.e. lifespan extending, shortening or not changing interventions. Then CR would always extend lifespan. But if I – instead of WT – apply CR to the Atg15 knockout – its lifespan shortens through CR. Our algorithm would fail because it was not trained on knockout data. This kind of failure is not at all a bad thing - but instead a blessing in disguise - because it is teaching us that apart from the feature intervention, there is also the feature genotype, which affects lifespan and which must be considered together with genotype like an indivisible unit-pair of atomic data, whose components must never be evaluated in isolation. We only could notice it because our only WT data trained AI imperatively failed on predicting the impact of CR on Atg15 knockouts. From then onwards we know that for correct prediction genotype and intervention must be given together as a pair to train our artificial intelligence (AI). This allows us to establish that apart from intervention, genotype is another essential feature for correctly predicting lifespan. So far, we only trained our AI on glucose media. Since it was the same for all the training sets this feature was not yet essential as long as it could only take on the same value. But when testing it on galactose, tryptophan or methionine deficient media our algorithm will imperatively fail again because now we need to consider a triplet as one piece of information, i.e. intervention, genotype and media. Only if we train our AI on indivisible triplet unit pairs our AI can succeed. We just have shown how intentionally creating variations in the condition can reveal new hidden objects but only when a naively perfectly working AI suddenly starts failing. But without naïve AIs to have failed we could have never discovered this new feature. Hence, causing perfectly scoring AIs to fail is a very good method of choice for discovering new features.

    However, if what I have been writing so far is all true, how come I not remember a single peer reviewed paper to discuss these issues from a similar perspective? Since I assumed that all these other sighted people are much better in conducting research than I could ever be, I assumed for a long time that there must be something wrong with my way of thinking, because if it would have been a beneficial innovation, then somebody long before me would have discovered it, especially since it was so much easier to write than all the other number crunching analytical tasks, for which I had to cite their methods properly. But to understand the importance of feature selection and feature discovery for our conceptual imagination, I did not have to read a single paper, because it’s all logic like a movie or story, where the chain of events leading to new feature discoveries are so intuitive that they feel like being absolutely obviously clear and self-evident. I can imagine being a yeast cell getting old under stress trying to adapt to prevent death, what would I need to do to survive?

    For the protein folding prediction I could make up plenty of regulatory scenarios, like the one a few paragraphs above, which could be tested in the web-lab. For example we know that the speed of translation depends on the charged tRNA ratios and concentrations in the cytoplasm and the Endoplasmic Reticulum (ER). For example, we know that three tryptophans in a row cause translation to stop prematurely since the concentration of tryptophan-charged tRNAs is too low for continuing translation on time. Using our newly derived Machine Learning (ML) feature selection methodical toolbox, we’d assume that we can see a consequence, i.e. premature translation abortion, for which we must now start looking for the still hidden cause. However, in this particular case, the obscure reason for the abortion is not even hidden because the mRNA nucleotides coding for the 3 typrophans can clearly and easily be measured and observed. But this tryptophan triplet, i.e. these 3 identical yet still distinct objects, started to form a kind of conceptual super-object possessing completely novel properties / features that none of its 3 individual units posses even in small parts on its own. This totally unrelated qualitatively completely novel dimension, which are completely absent in any of its parts, are like a gain-of-function effect. It terminates translation. Hence, these 3 termination causing tryptophans form a new shapeless super-object, on a whole different level, which cannot be accounted for by simply adding up the properties of the 3 tryptophans. Their mode of action to stop translation is of a much different nature than complementary codon-based translational mRNA/tRNA binding. The 3 tryptophan possess a new quality that cannot be distributed to each single tryptophan member.

    It’s kind of like we humans, who keep adding a lot of dimensionless, shapeless and physical matter-independent virtual features, based on which we distinguish between each other, which may be hard for AI to grasp. E.g., based on our first, middle and last name, SSN, citizenship, job, family role, etc., we make big differences between ourselves, which affect lifespan. Unfortunately, AI could not discover those, unless it can add the feature to perceive spoken and written communication, which is the only way by which our virtual self-imposed physical dimensionless features can be distinguished from one another.

    Most people tend to look ahead based on what has been achieved in the past. But I am shifting perspectives, in which I am master, and I am trying to look back on 2018 from the viewpoint of my history of immortality presentation from Dec. 25th 2030 about the many unnecessarily long and hard struggles to meet this life-essential objective. Looking back in hindsight, what could we have done differently, which retrospectively, looks so obvious, to have immortality already 5 years sooner? Often, truly scientifically breakthrough solutions remain unnoticed for many years, because everyone seems to believe that it cannot be so easy, since otherwise somebody else, much smarter than me would have discovered it long time ago.

    =========================================

    What makes this dissertation so very important?

    Although this part of my dissertation has no numbers and measurements, I feel it’s more important, because it’s a conceptual guide for those much more gifted in number crunching and numerical method integration than I could ever be, to refocus their attention on new feature selection and discovery to be completed first. That way their subsequent calculations and analysis are build on a much more solid foundation, on which the desired level of prediction can actually get accomplished.

    ========================================

    Why will feature discovery never stop?

    I’ll guess we may never stop this new feature discovering cycle to start over and over again, because as soon as we think we have got it to work, we hope to succeed in creating an exception, which causes are newly trained AI to fail, since this allows us to discover even another new relevant feature.

    We started out with the feature “intervention” and discovered “genotype” and “media type”. Now the next Kaeberlein dataset had features like temperature, salinity, mating type, yeast strain, etc. which also affect lifespan. Now for one knockout we could have more than 10 different reported lifespans. According to my understanding this would make the concept of purely aging suppressing or geronto-gene obsolete. This would raise the number of components, which must be considered together as an indivisible atomic unit, of which none of its parts must be considered in isolation, to consists of already 7 components, which must be given with every supervised input training sample for our AI. If this trend keeps growing like that, then the number of components, which form a data-point like entry, keeps growing by one new component for every new feature discovered. But would this not cause our data points to become too clumsy? But even if it does, for every new feature, which we decide to consider, our indivisible data unit must grow by one component. However, this would mean that 10 essential features would create data points of 10 dimensions. If we keep driving this to the extreme, when considering 100 new features, then we have 100 dimensional data points. But this would almost connect everything we can measure into a single point. This would put away with independent features because their dimensions will all get linked together. Is there something wrong with my thinking process here? I never heard anybody complaining about such kind of problems.

    From this chapter we can conclude that the best AIs are those, which fail in a way that allows us to discover a new feature.

    ==================================

    Chapter 3: About the amazingly deceptive power numerical analysis

    When I worked on my master’s thesis, I firmly believed that there must be one particular method, which is the right method for analyzing my dataset. However, many papers proposed different clustering methods. This was irritating because it seemed to me that the normalization and other methods changed the results. I thought that must not be. Only during my PhD I learned that depending what question I am asking I use different numerical methods.

    My PI gave me a CD with an Affymetrix microarray dataset with the transcriptome for 6 conditions, i.e. wild type (WT), Erg6-knockout and Atg15-knockout during caloric restriction (CR) and normal media (YEPD), to analyze.

    By training I am a molecular biologist and geneticist. But after my masters my eyesight declined too much to keep doing wet-lab work. So I was lucky to be given the opportunity to switch to bioinformatics. But it was hard. When I started I had no concept of normalization and for about the first 6 months I did not normalize anything. I tried to do all my analysis in Excel first until I found a tutor to teach me R.

    Actually, my bioinformatics learning curve resembles the feature discovery trajectory described earlier. I worked very hard on an R Script for gene enrichment analysis. I never really bought into the enrichment concept because it could be sufficient to double the expression of one rate limiting factor for 50 genes or it could require all 50 genes to up-regulate in order to double enzymatic function. Evolution had need for both extremes. I am just worried that those pathways, which only have one rate limiting gene, their up-regulation would never get detected by the gene enrichment methods.

    I wrote a fancy gene enrichment R script with pretty colors, interactive data points, flexible fold changes, quadrant selection, threshold for not differentially expressed genes and whatever else I could reproduce from what I could read. My master’s script automatically produced more than 100 data output files and is probably the most fancy enrichment script ever written in R. It has more than 5,000 lines of code. Unfortunately, no matter how many new features I added to my R script, I kept facing an embarrassing problem, which was that about half of the time its enrichment results were consistent with the literature and the other half they were not. So I felt like with my over 5,000 lines of code, I had created an R script for a fancy dice, which about half of the time told me what I wanted to hear and the other half it did not. This made me feel worried because all my coding had no more effect than rolling a dice or flipping a coin. I tried to add more and more features hoping to remedy this problem but in vain. This bothersome problem persisted. I was afraid to present because if people shared my perception that my script had the same effect as flipping a coin then nobody would let me graduate. Unfortunately, it did not get any better no matter what I tried. Hence, I started adjusting the fold change, false discovery rate (FDA), number of affected genes, quadrant selection, differential threshold expression, etc. to maximize the overlap between my script and the results reported in the literature. I was not happy with what I was doing, since I felt that I was kind of cheating, because whenever I did not like the outcome of my analysis, I changed my analytical methods. For example, some papers list fold changes of above 1.5 as differentially expressed and others above 2. I tried both and got totally different results. I was told to define a pathway as up-regulated if at least 60% of its gene members were above the fold change threshold limit. But just to see what happened I changed it to 40% or 50% and my results were totally different again. I could not trust my methods until I was one year behind with my masters. I had to present. At the time Patrick, a German post-doc from Harvard tutored me. He was aware of my problem and told me that this even happened to him at Harvard.

    He told me that the only way to deal with such kind of disturbing problems is to start picking cherries. I never heard of this term before. He explained that everything consistent with the literature needs to get reported and cited and everything, which was not, had to get omitted. First I did not want to do it, because I felt that if everyone would work like that, then nothing new could ever be discovered. Unfortunately, I was forced to shift my objective to graduation and away from discovering something, which could bring us closer to immortality.

    Since every time I ran my fancy R script, it produced more than 100 data files, scatter plots, pathway plots, gene tracking, intervals, fold changes, up- down- up and down files for all molecular functions (MF), biological process (BP) and cellular component (CC) Gene Ontology (GO) terms, I had plenty of material, for which I could search in the literature. I meticulously cited every instance I could find, which was in agreement with my R scrip analysis. That is why my thesis took me so long. However, I felt, I’d unfairly ignore all results, which I could not find in my literature search. I felt I should at least list them kind of like errors but I was told not to. I felt I spent two years writing a really fancy R script in vain because I filtered out every enrichment result, which I could not find in any peer reviewed paper. I felt I could have accomplished the same thing by randomly picking a GO term. If I happened to find it in another paper, I’d report it and if not I’d never even mentioned it. I felt I’d spent so much time writing a really fancy R script, but only allowing it to generate results, which I could find in scientific peer reviewed journals.

    =========================================

    Why was my bioinformatics masters degree a relief?

    I was glad when I graduated with my masters, because this meant to me that I no longer had to feel guilty about all the enrichment results I omitted, since I could not find them in the literature. However, I felt, if that is the way they do research at Harvard, then I could take the risk of working like that at UALR, because if Harvard post-docs could not figure it out, how could I? Unfortunately, if all scientists would work like that, then we could never make any progress towards immortality, because especially every young and still inexperienced researcher seemed to avoid reporting results, which had not yet already been validate by somebody else.

    At that time I had not yet developed my passion for preaching about feature discovery through condition and method variation. I noticed, I was told to use RMA normalization because everyone used it without even knowing the differences between RMA and other normalization methods. However, now I know that if we are too reluctant to vary the environment, methods, measurements and external background conditions, we cannot discover new features for training our AI algorithms.

    My master’s R script was probably the most numerically intense R script I had ever written. I was proud of my histograms and box-plots created with the ggPlot R package. I could print the name of every gene in its histogram column. I plotted the divergence between transcriptome and proteome with advancing age because Janssen et al (2015) claimed that this was the driver of replicative aging in yeast. My fancy R script gave the impression that the results it generated must be good since they were based on lots of fancy calculations of various R libraries like limma, yeast2, yeast98, ggPlot and many others.

    ==========================================

    Starting my dissertation on August 16th, 2016

    Why plotting lifespan datasets with intervals exceeding cell cycle duration?

    When I started my PhD program here in August of 2016, I plotted all lifespan datasets for the yeast2 and yeast98 microarray chip, which I could find on NCBI and Array Express. I expected that if I plotted the genes from enough lifespan datasets, then it would only be a question of time until I could see some trends, which I could associate with aging. I kept plotting and plotting until November, yet I could not find any linear trend. Before Thanks Giving break I finally ran out of lifespan time series datasets that I could plot. But I was supposed to construct gene-co-expression networks based on correlation measures. But just by visual inspection my plots looked like as if a kindergarten child had taken lots of colorful pens, one for each lifespan dataset, and randomly drawn colorful lines in each XY coordinate system.

    ==========================================

    Why plotting high temporal resolution cell cycle time series datasets?

    Since I did not know what to do, but I felt bad about doing nothing, I started plotting cell cycle time series datasets, despite not seeing any relationship or benefits in better understanding aging. But when I finished plotting the first cell cycle dataset, I saw for the first time the same periodic features and oscillation pattern as on the plots in the paper. I was happy to see the four cell cycle phases. I realized that I had basically lost all the time from August until Thanks Giving, because I stubbornly held on to my old linear trend concept of life, according to which certain genes get up or down regulated with age and that we can use these trends as aging markers. Now the cell cycle dataset taught me that, if instead of linear measures, I used periodic measures, such as amplitude, period length, phase shift and oscillation pattern, I could achieve much better clustering results than when using only linear trend features.

    ======================================

    What to do when getting stuck in research?

    Having lost more than 4 months to discover this, made me understand the importance for timely feature discovery and selection through almost random variations of conditions, methods and measurements because my results improved so dramatically that I started feeling confident to figure out a way to use plot similarities to infer functions of still unknown genes based on such kind of trajectory correlations. When I was stuck with the lifespan time series dataset, I should have just taken a chance and plot a cell cycle dataset maybe 2 weeks into the program and I’d be in much better shape now. But instead, I am guilty of what I accuse other scientists of, i.e. I was way too reluctant to changing features when getting stuck. But instead I was committed to keep plotting until I’d find a linear trend, which I could associate with aging. I only found one single out of 5,216 genes, which was down-regulated as aging advanced.

    =======================================

    What the longest still accepted intervals between measurements for high temporal resolution time series cell cycle data?

    Only because I had the additional periodic features, high temporal resolution cell cycle time series data provided enough opportunities for trajectories to differ from one another within at least one cell cycle phase to assign them into different functional units. But for that the intervals could not exceed 25 minutes or else I’d miss the short M phase.

    What were the shortcomings of the Janssen et. al. 2015 lifespan dataset?

    But the intervals between the Janssen data was between 3 and 11 hours, i.e. much longer than the average cell cycle, which is about 120 minutes in WT yeast at YEPD. The Janssen dataset only had 12 time points for 4,902 genes and for about 1,500 proteins. But 12 time points are not enough to sort more than 4,900 genes into functionally distinct clusters based on correlation measures. This became evident because when I used my masters R script for enrichment analysis, I almost got no enrichment for lifespan datasets, but I did get very good enrichments for my functional clusters from my cell cycle datasets.

    ==========================================

    When do hidden feature discoveries tend to take place?

    Although I consider myself a logically thinking scientist, I did most of my new previously hidden feature discoveries when I expected them least. No data had pointed me to cell cycle time series datasets to discover the benefits of clustering according to periodic features. That is why intuition is so important. Even numerically hungry scientists must be capable to navigate and make decisions without numbers in order to progress rapidly in feature discovery.

    ========================================

    What is the major aim of this dissertation?

    I hope that my dissertation will be understood as a guide for random variation-driven feature discovery, because this is what is holding our scientific progress back. In hindsight, it looks so easy and obvious to have started with high temporal resolution time series cell cycle data, because that gives me periodic measurements for clustering. So logic! Easy! A no brainer! But it took me four months to figure this out. I did not discover this feature by thinking of it on my own. I discovered it by chance when I plotted my first cell cycle dataset out of frustration about having run out of lifespan time series datasets to plot. But you can see that there was absolutely no progress between August and November, because the needed missing features remained hidden to me. Only by randomly uncovering those features, the objective of my dissertation finally seemed possible.

    =======================================

    How many still hidden concepts separating us from reversing aging?

    But how many of such kind of key new essential feature discoveries are we still away from solving and reversing aging? The lack of progress in extending lifespan by more than 50% indicates serious problems with feature selection. What does it take to make our experimental life scientists to please understand this essential feature selection concept through variation? When I published this concept online about 2 months ago I suddenly became the most read author of the entire UALR.

    What is my dream job?

    I actually want a job as a biological scenario writer to come up with hypothesis, which require variation of conditions, methods and measurements, and have the power to tell wet-lab experimentalist, which study design I want to either prove or disprove my biological regulation scenarios. I don’t want any of my highly speculative scenarios to get rejected unless properly and thoroughly tested.

    What kind of datasets would I need to reverse engineer aging?

    If I had the choice, I’d demand to measure transcriptome, proteome, metabolome, epigenetic, lipodome, automatic morphological microscope pictures, ribogenesis, ribosomal foot printing, DNA chip-chip and DNA chip-seq. analysis, speed of translation, distribution and ratios between charged tRNA in cytoplasm, length of poly-AAA-tail, non-coding RNA binding, autosomal replicating regions (ARS), vacuolar acidity, autophagy, endocytosis, proton pumping, chaperon folding, cofactors, suppressors, activators, etc. every 5 minutes over all cell cycles, because only in this way these different data sources could be properly integrated.

    What is the temporal alignment rejuvenation hypothesis?

    I suspect that temporal misalignment between genes, which must be co-expressed together, and genes that must never get co-expressed together (e.g. sleep and awake genes), is getting gradually lost with advancing age. E.g. when I was 6, I remember falling asleep at 8.30 p.m. and getting up at 7 a.m. without feeling that any time between falling asleep and waking up would have elapsed. This meant that during my sleeps all my sleepy genes were nicely turned on and all awake genes were fully turned off to prevent interfering with the regenerating processes, which require deep uninterrupted and healthy REM sleep. Unfortunately, now only forty years later nights seem to take forever. I am not even sure if I am awake or sleeping or something in between. This is because not all my sleepy genes are properly turned on during the night and not all of my awake-genes are properly turned off during the night. This will cause the still partially turned on awake-genes to interfere with my sleep process. That is why I cannot regenerate anymore and feel so tired all the time. If we could reset the temporal alignment to youthful stages, maybe I could get more energy. It’s such a simple and logic hypothesis but I cannot find anyone willing to test it. We have the means! The yeast only lives a few days. I could have a very high resolution full –omics dimension-spanning multi-dimensional dataset in less than a week. Why almost nobody seems to understand? I am feeling all alone with my thoughts and ideas. Is it because I see something that is not there or because others cannot see what I can? But if they cannot they should at least listen to me and read my long emails or talk to me.

    Have I written the past 21 pages well enough for people to understand? Will anybody ever take the time to read 21 pages patiently? Even though I am much more proud of this than of past nights writings, it still does not read like a dissertation. Which conference or journal could be interested in publishing my work and promoting my concepts? I think we need to make workshops for wet-lab experimentalists to learn the tricks of feature discovery by actively applying and trying them out. If I post this writing on the web or on Biostars.org it will only get deleted as spam. The people, who need to understand it, will never take a look at it. They are simply too busy to vary their conditions and working habits. I don’t want to graduate unless I get a job, which allows me to keep writing such important conceptual papers, since I feel I am the only one, who can write them at ease and within a single night non-stop, because apart from my writings, I have not seen any similar papers. So if I get pushed out of my office, I will no longer have the motivation to keep typing all night trying my very best to make people understand.

    ======================================

    Why is this university setting so important for my well-being, motivation and productivity?

    Now I am still at school. Now I can still apply for conferences and travel grants. Even though these things are very important for me right now, once the university will be taken away from me, I can no longer work creatively, because the structure of a university setting, which allows me to work hard, despite being autistic, dyslexic and visually impaired, will be gone. It will be as depressing as it was during the year I had been suspended to Germany, during which I could not accomplish anything.

    ========================================

    Why can nobody see the keys to immortality in this writing?

    I still feel that this writing includes the key to the keys to immortality and rejuvenating until being forever young. But apart from me nobody seems to see the keys for all of our collective survival in this and my other feature discovery and selection writings. People keep throwing them in the trash. Once I graduate I’d have no more motivation to write. All my intuitions, thoughts, insights, visions and ideas will get lost. The day I graduate will be the day when the dream of immortality dies, because my access to the very limited resources, which I must have in order to remain motivated to advocate immortality by varying feature discovery searches, will be taken away.

    ===============================

    Who will help me?

    Who could I email these 22 pages for proof-reading and feedback? No conference will accept this 22 pages long proposal. Now after I have invested so much time and effort to write to make people understand, I must delete lots of my work to adhere with the format requirements set by the people, whose attention I want to draw with my writings.

    Best regards,

    Thomas Hahn

    n Office Phone (landline): +1 (501) 682 1440

    n Smart Phone: +1 (501) 303 6595

    n Flip Phone: +1 (318) 243 3940

    n Google Voice Phone: +1 (501) 301-4890

    n Office Location: Engineering and Information Technology (EIT) Building, Room 535

    n Mailing Address: 2811 Fair Park Blvd., Little Rock, AR, 72204, USA

    n Skype ID: tfh002

    n Work Email: [email protected]

    n Private Email: [email protected]

    n Twitter: @Thomas_F_Hahn

    n Facebook: https://www.facebook.com/Thomas.F.Hahn

    n LinkedIn: https://www.linkedin.com/in/thomas-hahn-042b2942/

    n ResearchGate: https://www.researchgate.net/profile/Thomas_Hahn10

    n Academia.edu: https://ualr.academia.edu/ThomasHahn

    More Thomas Hahn's questions See All
    Similar questions and discussions