Would this method for uncovering hidden objects, feature discovery, feature selection and supervised learning work?
According to my understanding perfect prediction by computation optimization must be impossible as long as essential input training data features for supervised learning are still lacking. Therefore, I emphasize not to invest much effort into algorithmic computational optimization before the feature discovery process allows selecting all needed features through variation before realistic predictions can be achieved by enhancing computations and parameters. For me this simple progression from completed feature selection to optimized algorithm is essential for more rapid discoveries. That is why I am still wondering why I have seen so many papers focusing on improving computational predictions without any kind of prior considerations about having properly completed the absolutely essential exhaustive feature selection process without which no subsequent computations can lead to satisfactorily predictions, which are reasonably consistent with our experimental observations and measurements.
I am afraid that I am still the only one, to whom the writing above can make sense. I had expected much more enthusiasm, excitement and optimism about very likely accelerating our discovery rate by first focusing on uncovering still hidden objects and features through diverse variations of conditions, procedures, methods, techniques and measurements, followed by the exhaustive selection for all relevant needed features, followed by designing, developing, combining and optimizing the computational steps of our machine learning algorithm until our predictions match our experimentally obtained observations. Once this major machine learning objective has been achieved we have reached its final status beyond which we cannot improve it unless we can generate conditions, which cause our previously perfectly predicting machine learning algorithm to obviously fails because it is an absolute prerequisite for discovering more relevant training reflected by more dimensions.
According to my current understanding any newly discovered essential input data feature inevitably causes a rise in the dimensionality of input data components, which must be considered together but never in isolation. For example, if I train with 100 input features our input variable must consist of exactly 100 components or parts, which together form a new level of single measuring points, which tends to be much different in controlling its manipulations and their overall effects from anything, which could possibly get anticipated, when trying to add up the effects of its 100 parts to a new total. This new total tends to consist of much different dimensions and often refers to completely unrelated kind of data than when combining all 100 components consisting of exactly the same input values for every of its 100 variables but by considering all 100 components like a single indivisible unit of measurement points, which often results in completely different kinds of unrelated seeming properties/features, which are not even closely reflecting the results, which we'd obtain if we executed each dimension on its own in isolation and sequential order.
For example, stopping translation prematurely by three consecutive tryptophans has a much different impact, i.e. stop, then when translating each of the tryptophan in isolation separated by other amino acids, since this causes the nascent polypeptide chain to grow.
Each tRNA charged with tryptophan, which complements mRNA triplets, causes the peptide to grow by a single tryptophan, which gets added to it. So when you try 2 tryptophans the polypeptide grows by two amino acids. But when you try 3 consecutive tryptophans, then - counter-intuitively - instead of the expected growth by 3 amino acids - translation prematurely stops. Stopping translation prematurely is of a much different dimension, level, effect and data kind, then when keep adding more amino acids to the growing peptide chain. If we consider the effect of complementary binding of a tRNA to its mRNA codon our peptide grows by one amino acid any charged tRNA adds another amino acid of one of 20 categorical values. Normally no amino acid can cause the translation to step prematurely, not even two amino acids as a pair. But three amino acids as an indivisible triplet, which must be considered as a new single value requiring all three tryptophans to be sequentially to be present, like a single indivisible data block, which must not be divided only in triplets, because only triplets, but no pair or singlet, can stop translation prematurely.
Another example is predicting overall cellular protein composition. It depends on how many mRNA strands coding for a particular protein are in the cytoplasm. There is proportionality between number of cytoplasmic mRNA strands and total protein. Therefore, if the cell needs to double protein abundance it could double transcription and keep everything else the same. But a much better and less step intensive, more economic way of doubling protein concentration is to double the length of the poly(A) mRNA tail. Extending the length of the poly-(A)-mRNA-tail may require about 100 additional adenine whereas doubling transcription requires about at least 500 - instead of only 100 - new nucleotides in addition to all needed transcriptional modification steps with their elaborate synthesis machinery.
If the dividing yeast must raise its lipid synthesis by more than 10-fold during the short M-phase, it could increase transcription by a factor of 10, it could make the poly-(A)-mRNA-tail 10 times longer, or it could synthesized 10 times more new ribosomes to increase the ribosomal translation by a factor of 10 simply by reducing the distance of free uncovered mRNA nucleotides between adjacent ribosomes translating running down the same mRNA strand. If more than one ribosome is translating the same mRNA strand simultaneously, it is called a poly-ribosome. Hence, having 10 times more ribosomes binding to the same mRNA strand at the same time increases translation by a factor of 10 without needing any additional transcription.
Above I have given three easy examples to get 10 times more proteins. Although all 3 methods have the same final result, i.e. 10 times more proteins, their mode of action, their required essential features, their dimensions and their minimally required parts, which must be considered like a single value, are totally different.
If the cell employs all three options simultaneously it can raise protein abundance by a factor of 1,000 during the short only 20 minutes long M-phase.
The relevant essential input feature for poly-(A)-mRNA-tail-based input to predict protein increase is simply the number of adenines added to the tail. The only essential selected feature is a simple integer numeric not requiring any dimensional specifications since only adenines can be added. But we should note that unit of the required feature is number of adeninea.
However, when increasing transcription the input feature is number of mRNA strands. Note that the number of mRNA strands cannot be directly converted into number of added poly-(A)-adenines. Synthesizing an additional mRNA strand affects protein abundance by a different mechanism and amount than adding an extra adenine to the tail. There is probably a way to experimentally figure out how many more adenines must be added to the tail to increase protein abundance by the same factor as synthesizing an additional mRNA strand.
The input feature for ribosomal coverage is an integer of the unit ribosome. Adenine, mRNA strand and ribosome are different feature dimensions. We could now experimentally figure out how many additional mRNA strands need to be transcribed to increase protein abundance by the same amount as adding a single new ribosome. Then we could figure out how many adenines have the same affect as a ribosome and how many adenines have the same effect as an additional mRNA strand and how many mRNA strands have the same effect as a ribosome on overall protein concentration increase. This will give us a nice conversion table. This gives us fixed adenine to mRNA strand, fixed adenine to ribosome and fixed mRNA strands to ribosome ratios based on which we can make meaningful predictiong in each of the different dimensions, which contribute to protein abundance by completely different and unrelated modes of actions, i.e. 1. adding an adenine, 2. transcribing a mRNA strand, 3. synthesizing a ribosome.
To simplify assuming that translation rate can only be affected by varying length of poly-(A)-tail on mRNA, transcription and ribosome synthesis rate, which essential features do we need to train our machine learning algorithm?
Answer: We need 3 input features. i.e. 1. number of adenine of the dimension adenine, number of mRNA strands of the dimension mRNA strands and number of ribosome of the dimension ribosome. For each of these three dimensions we will get an integer input value. Based on our previously experimentally determined calibrated conversion table between the translation rate affects of the input features, i.e. namely adenine, mRNA strand and ribosome, we can predict total protein abundance.
The total protein abundance should not be affected by whether or not we are considering adenines, mRNAs and ribosomes in isolation and sequentially or combination of triplets or combinations of twin pairs because each of these three dimensions and their mode of action can function totally independent from one another.
An example for the pair or triplet unit concept is given below.
For any wild type (WT), caloric restriction (CR) is a lifespan extending intervention compared to YEPD (normal yeast growth media). As long as this holds true always CR is a reliable way to extend lifespan. We could train a machine learning algorithm, which predicts lifespan only based on the input values, CR or YEPD assuming we have WT. This is a very simple binary classifier. As soon as we got it to work we want to cause it to fail. To accomplish this we vary lots of features, e.g. protein, genotype, temperature and whatever else we are capable of. We will keep doing this until we find an instance, where our algorithm predicts an extension whereas our observation shows a shortage in lifespan. If we find such a case, we must scrutinize the yeast cell, for which CR truly extended lifespan and compared it with the other cell, for which CR suddenly shortened lifespan. The fact that the same manipulation, i.e. CR has opposite effects on both still very similarly looking phenotypes, must make us understand that both yeast cells must no longer be considered to be an object of the same kind. The fact that CR affected both instances of yeast in opposite ways makes it imperative that both yeast cells must differ in at least a single feature from one another.
Our next task is to compare both cells until we find the defining and distinguishing difference(s) between them because without such kind of a difference CR could not have opposite effects on two exactly identical instances of yeast cell. After carefully checking we notice that both yeast cells are fully identical to one another, except for their Atg15 sequence, in which they differ. This was a very simple example of essential input feature discovery. Let’s assume that we are conducting such kind of feature discovery before having formed a concept of a gene. In this case, we have 2 Atg15 sequences. For the first one, CR extends lifespan, but for the second one exactly the same kind of CR shortens lifespan by at least 7 replications. This discovery causes our concept about lifespan extending interventions to become obsolete because of a single example, where a difference in Atg15 nucleotide sequences causes CR to shorten lifespan. When we look at protein abundance we can easily see that the lifespan of the Atg15-less yeast gets shortened by CR whereas the lifespan of the yeast with Atg15 protein (WT) gets extended exactly by the same kind of CR. We have succeeded in finding a reproducible causal relationship, which is causing our single dimensional input feature, CR, or YEPD to fail every time when the phenotype lacks the Atg15 protein. We have discovered a new feature. Whatever difference or change causes our old machine learning algorithm to reproducibly fail in the same manner by the same distinct input parameters that the dimension in which they differ from one another must be our newly discovered feature, which we must include in our feature selection before we can return to retrain the machine learning algorithm based on both instead of only one feature.
As long as we only had a single genotype, i.e. WT, the input-feature genotype was not needed because it was the same for all the encountered instances of yeast cell. Since the genotype was always the same, i.e. WT, the object or feature "genotype" remained still hidden because it could not be used to distinguish between yeast instances as long as in all cases the genotype is WT. In this particular case the object "gene" itself, may not be hidden, because it consists of physical matter, which we can measure but as long as this feature was always WT, it could not show up as a feature unless it can take at least two distinct values. As soon as we discovered that the visible object genome differed in their feature Atg15 protein present or absent, we must recognize that we must provide our algorithm with a second data dimension, because CR shortens life in Atg15 knockout while lengthen it in WT. We have discovered the first example, in which the visible object Atg15-coding gene could take two distinct values, either knockout or WT. This puts us into a good position to proceed with gene based feature discovery until we succeeded in knocking out Erg6. Again, for the Erg5 knockout CR shortens lifespan to less than 10 replications whereas it lengthens lifespan by about 4 replications for WT. CR can no longer be considered a lifespan extending intervention because - in order to train an algorithm with supervised learning on predicting lifespan effects we must provide now 2 dimensions, i.e. 2 training input features, i.e. glucose concentration and genotype. In this example the 2 components, i.e. glucose and genotype, must be considered as an indivisible informational pair. When considering any of its 2 components in isolation proper supervised learning and correct prediction are impossible. Only by considering both components (dimensions) together like a single indivisible measurement point, allows for proper input feature selection.
Let’s assume all measurements described above were performed at 30 degree Celsius. As long as we only have a single and always the same value for temperature, we can measure it, but it remains a hidden feature, until we start to vary it. Let’s say heating up to 40 degree Celsius generally shortens WT lifespan due to the heat shock induced improper folding experience, but that for some knockouts raising the temperature from 30 to 40 actually increases lifespan. This will result in a three dimensional input vector.
Important: I think I just discovered that when the different input features cannot be converted into one another by a kind of conversion table as we had for adenine, mRNA and ribosome and hence yield the same results regardless whether we consider the features separately and in sequence or together as a single indivisible unit because the mechanisms of actions for each dimensions don't depend on one another and can be performed completely independently without having to share any scares resources.
However, when we have selected input features/dimensions, which can never be converted in one another by a simple experimentally obtained proportion ratio table, as it is clearly the case for CR vs. YEPD, knockout vs. WT, or 30 vs. 40 degree Celsius, then our three-dimensional input variable, which consists of a glucose, genotype and temperature component, both of which are Boolean variables in this example, all three components must be considered together like a single indivisible unit consisting of components, which must never be evaluated in isolation from one another, and they must form a single value providing a unit input feature value, in order to make proper lifespan predictions.
It is very similar to our numeric system. Let’s say I have 1, 2 and 3. This is another example for case above where 123 is not the same as processing 1, 2 and 3 in isolation and sequentially. If I have a three-digit number, I must always consider all three digits at a time to make good predictions, similar to predicting glucose, genotype and temperature always together without ever splitting it apart into its single components.
Now lets say I have A, B, and C, it does not matter in which component order I consider these three letters and whether I process them as a single unit or one after the other, the result should always be the same unless they can form different Words.
Feature discovery, feature selection and machine learning algorithm training and tuning must be performed as three discrete, separate steps, life the waterfall model, which requires that the previous step must be fully completed before the subsequent step can be started. Since improper feature selection has held us unnecessarily long back because somehow the main influencers, who have the biggest impact on the way research is conducted and studies are designed, seemed to have overlooked the fact that computational model prediction can only work if all of the needed input features have been selected properly. But this realization is so basic, fundamental, obvious and self-explanatory that it would be common sense to follow it implicitly. But from my literature searchers I remember many papers discussing the effects of variations in calculation on predictive outcome. But I cannot remember any paper, except for my NLP poster last summer, where feature selection was explicitly used as a method to improve predictive outcome.
The main problem is that I am almost blind and can only read a paper per day. Actually, I can type this much faster than I can read it. This could mean that feature selecting papers are also plentiful, but that I did not get the chance yet to discover them. However, I am almost certain that nobody has every used my concept of hidden objects and features and written out detailed examples for feature discovery, feature selection, algorithm training leading to almost perfect prediction.
if perfect prediction has been accomplished we are actively search for conditions, which cause our predictor to fail. Such kind of failure is caused by a still undiscovered difference between two objects, which have been considered to be exactly the same until a reproducible difference of the same treatment makes these two instances 2 separate imperatively distinct objects, which must no longer be substituted for one another because they must differ in at least one single feature from one another. We must compare all aspects of such objects until we discover the feature that allows us to distinguish between them. This new feature, by which they differ, is a new essential feature, which must be added to the set of selected features before any kind of adequate predictions are possible by tuning our machine algorithm with another new input feature and dimension of input training variable.
Dows this explanation make sense?