As I read its an expected value and the lower the value minimum represents alignment by chance. But I could not understand what the representation (for example, 1e-63 etc.) is actually denoting.
Not exactly. Indeed, the math behind the calculation of e-values, is defined by each program you use (BLAST, HMMer, etc). But in all cases it can be interpreted as the probability of observing such results by chance. For instance, if you compare two identical sequences using BLAST, it will return a value near 0. Because having two identical sequences (but without knowing they are the same) is quite improbable. It should be clear that e-values ARE NOT probabilities because probabilities only range from 0 to 1, and e-values can take values beyond 0. It depends on what you are trying to find. But usually, a lower e-value indicates a better quality in the search/alignment/comparison. It is preferred over the score value because e-value is less sensitive to sequence length. For instance (again) in BLAST, if you compare two identical sequences, each with less than 100 residues, you'll never have a score beyond 200. In the opposite case, if you compare two large sequences (even when they are not very similar) you can easily get a score over 200. In such cases you should always check the e-value because it will give you more information on the quality of your alignment/search.
It's the chance if you took two random sequences you would get so good alignment. 1e-63 is standard way to write low numbers. It's the same as 1×10^(-63). For example 1×10^(-5) = 1e-5 = 0.00001
Not exactly. Indeed, the math behind the calculation of e-values, is defined by each program you use (BLAST, HMMer, etc). But in all cases it can be interpreted as the probability of observing such results by chance. For instance, if you compare two identical sequences using BLAST, it will return a value near 0. Because having two identical sequences (but without knowing they are the same) is quite improbable. It should be clear that e-values ARE NOT probabilities because probabilities only range from 0 to 1, and e-values can take values beyond 0. It depends on what you are trying to find. But usually, a lower e-value indicates a better quality in the search/alignment/comparison. It is preferred over the score value because e-value is less sensitive to sequence length. For instance (again) in BLAST, if you compare two identical sequences, each with less than 100 residues, you'll never have a score beyond 200. In the opposite case, if you compare two large sequences (even when they are not very similar) you can easily get a score over 200. In such cases you should always check the e-value because it will give you more information on the quality of your alignment/search.
Actually it is not a chance, nor a probability value, but rather the estimate of how many times (this means "counts") you would expect a result (e.g. a score in a sequence comparison) at least as extreme as the one observed occurring by chance. A value close to zero means that you would practically expect no unrelated sequence to score as high to your query sequence. Apparently, no negative e-values may be observed ...
It can be useful to note that "when E < 0.01, P-values and E-value are nearly identical", as Blast authors say in http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
E values are p values multiplied by the sample(or database) size. That means a low evalue isn't always good if sample space is small. It follows that higher e-values from extra large databases are not always bad. In simple terms- evalue of x says that given the sample size used in the study(say no. of sequences blasted against), there is a possibility of x hits to gain as much or higher score than your current hit. So an evalue of 0.001 is not same if the database was 10 proteins vs 10000 proteins.
The E-value estimates the expected number of records in the database that will be returned with a score as good as or better than the score of the record under scrutiny. Hence, under the assumption of a Poisson distribution (which pertains in BLAST), "when E < 0.01, P-values and E-value are nearly identical".
The E-value threshold where real results transition into random noise varies from application to application (e.g., it is different for BLAST in proteins and BLAST in DNA), and it must be determined empirically in each application.
The estimate is often generated from a theoretical probability model that produces a "random database record". In BLASTP, e.g., the theoretical probability model produces a random protein sequence by stringing together independent letters chosen from the "Robinson-Robinson" frequency distribution, which approximates the amino acid composition within the protein database.
My experience was (in the far past - I am not sure if the statistics was improved since) that you MAY get low E-values with unrelated DNA sequences (not proteins!). But that was 1E-10, not 1e-63. So I think you can interpret such a low e-value to mean "getting even one sequence with such similarity as this sequence shows to my query is very unlikely". As mentioned about, the probability of getting even one can be estimated to be about 1E-63.
Nearly there: Spouge (it's expected number of records returned *by chance*, or expected number of *false* records)
An E-value is simply the expected number of errors per query. The emphasis being on PER QUERY.
A word of caution on how to use E-values: An E-value is similar to false discovery rate (FDR), and is an estimation of the number of errors per query (see the biological sequence analysis book (Durbin et al., 1998) for further reading). What this means is that for a given query, if you choose a cut-off E-value of e.g. 0.01, then you would expect 0.01 errors with an E-value lower than your cut-off. Using a cut-off of 0.01 you will have an error in your results one time in every 100 queries, which is a tolerable rate of error for most research investigations. The crucial part of the E-value definition is the ‘per query’ part, and this is particularly true when using SUPERFAMILY and Gene3D for whole genome analysis. The E-values change depending on the query. If you search two databases you will get two times the errors; this will produce the same number of errors as searching one database that has twice as many sequences. Therefore the E-value calculation depends on the database size, and the same sequence hit will have a different E-value if it is the result of a larger or smaller total query. The E-values in data bases such as SUPERFAMILY (supfam.org) are calculated as a single sequence searched against a model library, hence if you count the number of errors in a complete genome containing 1000 sequences, it will be 1000 times the E-value cut-off you choose. The cut-off displayed by SUPERFAMILY is < 0.0001), which means there will be one error in10 genomes of size 1000, or e.g. 5 errors in a genome that hypothetically has 50,000 sequences. Approximately half of the potentially false hits will be ignored because they conflict with a stronger, true assignment. However, the remaining errors may stand out because they do not make biological sense, so it is crucial to understand how to interpret the E-values when data mining; if you are looking for something unusual in thousands of genomes totalling millions of sequences you will find hundreds of errors using the default cut-off in the databases, several of which will appear biologically unusual.
The notation is simply that of ONE very small number, but not any small number. For the type of problem in question, the expectancy can be estimated as E= K m n exp(- lambda S), where m and n are the length of the sequences (target and query), K and lambda are descriptors for the mathematical distribution of high-scoring segments in a large set of random sequences aligned to the target sequence, and S is the the score of the query sequence. This formula (and others equivalent, which depend on how the score S is obtained) means exactly how frequently would the S score (or higher) shall be found if I align a really large number of random sequences to the target. Given the over-astronomical combinatorial possibilities of biological sequences (proteins or NA), a really good alignment is rather infrequently found by chance. Given defined values for K, m, n, lambda and S, the value is not arbitrary, it gets exactly defined.
Sometimes we are getting e value like 5e-124, 3e-71, 6e-153, 1e-115. in BLAST search. Which would be considered as best search? If anybody explain please.
Best answer yet! So the number before e is the number of decimal places in front of the negative number after e, or the multiplier of the positive number after e. Thank you!
I do not mean to be rude, but the answer by Mehdi Roshdi Maleki makes little sense to me. E-values in NCBI blast searches are estimations based on bit scores, and their statistical meaning is clearly explained in an article I mentioned in my first comment to this queue (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html).
The BLAST output notation is a computational notation 1E-63 means literally:
1 x 10-63
Now, bit scores in alignments measure the coincidence of two sequences, and are obtained using a "substitution matrix" over those segments (for nucleic acids the substitution matrix is an identity matrix). A summation of the comparison made position by position measuring the relatedness of the "symbols" at each pair produces an "relative" score.
In turn, substitution matrices are artificial sets of numbers indicative of how conservative is to replace a residue by another. "Conservative" here can have many definitions (size, chemical features, number and likelihood of mutations required to make the change in the gene CDS, and so on), the two most common are the PAM scores (based on a "mutational distance" artificial scale) and the BLOWSUM (based on substitution frequencies as observed in biology). In the simplest case, two symbols can be considered conserved if identical (scoring 1) and non-conserved otherwise (scoring 0).
The statistic ground of alignment quality is based on ungapped alignments, but there is good evidence showing that ungappend alignments can be judged with the same tools.
Now the E-value is an estimation of how likely is to find two random sequences aligned with an "Score" a good as the one we observed for the query and the hit, given the size of the sequence length and the size of the database searched.
The score itself is an artificial estimate and will depend of the sizes of the query, database, and of the true (theoretical) search space. The size of our sequence and database are known, but the theoretical search space can only be estimated roughly.
To make the scores more meaningful, BLAST bit scores are scaled and normalized to have standardized units, no longer dependent of query and DB sizes.
The E-value is an exponential (base 2) unscaled form of the bit score:
E = m n 2-S' , where S' is a normalized bit score, and n, m are sequence-lengths of the query sequence and the database.
The probability of finding a random sequence giving a high scoring segment (HSP) with an Score S' equal to, or higher than, the one we are observing follows the Poisson distribution; therefore, the probability of finding at least one HSP, scoring as good as (or better than) the one we observed, can be estimated with:
P = 1- e-E.
Therefore, for very small E-values (less than 0.01), the P-value is (in practice) the same as the E-value (the P value would be 0.00691). For numbers bigger than 0.05 (P=0.0341), the statistics is telling you "If we make-up random sequences (of the same size as your query) and repeat your search, we are likely to find roughly 3 cases, out of a 100, where such random sequence will hit a target and give us the same score your just found with your query sequence". In other words, you may have hit the target by chance.
THEREFORE, E-VALUES:
A) Are estimates, not exact values, for the precise size of the "true" search space is not known.
B) These do not give you probabilities directly, but are closely related the "null hypothesis" error probability usually found in many classic statistical tests.
C) These probabilities DO DEPEND on the query sequence length and the target database size.
D) So the size of the database and the sequence length may prevent you from using E-values to compare the results of two searches. But you may use the bit score instead.
For instance, if you make your search for a polypeptide sequence in the nr-database and then you repeat the search in the uniprot-sprot database, and find the same hit, your E-value should be considerably smaller (~5 times) in the second case. Do compare these two results for the same sequence (KRNKALKKIRKLQKRGLIQMT) finding the same top hit.
E-values are differente, but the Bit score is standarized and normalized and can be compared (identical for these two identical alignments).
Remember that science requires formality, because results must be repeatable, not just by me, but by others too. A formal and unambiguous definition of every parameter you employ is thus required, so others are provided with reliable tools to repeat and extend your observations and/or experiments. Formal definitions do imply a formal interpretation of values.
The answer from Mehdi Roshdi Maleki is entirely wrong, shockingly so.
This thread is very good example of why we need moderators on Research Gate - that kind of answer should be deleted, and the contributer warned for breaking community standards by giving such misinformation.
In the absence of an omniscient RG moderator, we will all learn from our mistakes when they are pointed out to us. And working through those mistakes is a helpful exercise for everyone.
I agree with Alastair, unfortunately, moderators would probably spend a lot of their time moderating the lists, without little, if any, reward to their effort.
In this very same site a discussion about peer reviewing on Journals reached one unavoidable conclusion: "with all of its limitations, having peer reviewing is better than go without it". Accordingly, participation in peer reviewing has recognition by the community (i.e. a reward). What would be the reward to list moderators in this site?
The number of different alignments with scores equivalent to or better than the query that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
Only the second statement you mention is correct. For details please refer to https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html under the section "P-values".
You can easily see why the second statement holds if you plot the function y = 1 - exp(x) ... (see here https://drive.google.com/file/d/1DQFC7DVDgLvMi_XT6qaUsRpdJjUghgqk/view?usp=sharing)
Where you can get a good idea of how this statistics is produced.
You may also read the contribution i made to this thread 2 years ago.
Summarizing:
Bit scores in alignments measure the coincidence of two sequences by the use of a "substitution matrix". A summation of the comparison made position by position measuring the relatedness of the "symbols" at each pair produces an "relative" score. The score is completed by substrating gap penalties. Obviously, his score depends on the length of the alignment.
Then BLAST bit scores are scaled and normalized to have standardized units, no longer dependent of query and DB sizes. The E-value is an exponential (base 2) unscaled form of the bit score.
The lower the E-value, the less likely that the similarity you are detecting is just an accidental coincidence.
E-values are similar to the error probabilities we use in classical statistical tests, except probabilities go from 1 to zero, while E-values go from +infinite to zero.
E-values equal to 1 or above indicate the lack of statistical significance.
E-value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. As Rogelio Rodríguez-Sotres mentioned, the lower the E-value, or the closer it is to zero, the more "significant" the match is.
Irina St Louis , yes. The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.