What does the E-value exactly mean and what does (1e-63) represent?

06 June 2012 43 6K Report

As I read its an expected value and the lower the value minimum represents alignment by chance. But I could not understand what the representation (for example, 1e-63 etc.) is actually denoting.

Victor Flores Popular answer

Not exactly. Indeed, the math behind the calculation of e-values, is defined by each program you use (BLAST, HMMer, etc). But in all cases it can be interpreted as the probability of observing such results by chance. For instance, if you compare two identical sequences using BLAST, it will return a value near 0. Because having two identical sequences (but without knowing they are the same) is quite improbable. It should be clear that e-values ARE NOT probabilities because probabilities only range from 0 to 1, and e-values can take values beyond 0. It depends on what you are trying to find. But usually, a lower e-value indicates a better quality in the search/alignment/comparison. It is preferred over the score value because e-value is less sensitive to sequence length. For instance (again) in BLAST, if you compare two identical sequences, each with less than 100 residues, you'll never have a score beyond 200. In the opposite case, if you compare two large sequences (even when they are not very similar) you can easily get a score over 200. In such cases you should always check the e-value because it will give you more information on the quality of your alignment/search.

Hope it was helpful.

Best regards

Tomáš Hluska

It's the chance if you took two random sequences you would get so good alignment. 1e-63 is standard way to write low numbers. It's the same as 1×10^(-63). For example 1×10^(-5) = 1e-5 = 0.00001

Victor Flores

Hope it was helpful.

Best regards

Tomáš Hluska

I'd like to see negative e-value.

Nutan Chauhan

@Victor Flores very well xplained

Shameer Pillarisetti

Its tells about the expectation value if you take one residue to match whole sequence and there are 63 identities residues in subject sequence

1e-63,2e-63,4e-63,8e-63,10e-63 among these E values 10e-63 is more similarity sequence.

Prem Prakash

Thank you so much to all of you...:)

Vasilis J Promponas

Actually it is not a chance, nor a probability value, but rather the estimate of how many times (this means "counts") you would expect a result (e.g. a score in a sequence comparison) at least as extreme as the one observed occurring by chance. A value close to zero means that you would practically expect no unrelated sequence to score as high to your query sequence. Apparently, no negative e-values may be observed ...

Andres Aravena

It can be useful to note that "when E < 0.01, P-values and E-value are nearly identical", as Blast authors say in http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Amit Kumar Yadav

E values are p values multiplied by the sample(or database) size. That means a low evalue isn't always good if sample space is small. It follows that higher e-values from extra large databases are not always bad. In simple terms- evalue of x says that given the sample size used in the study(say no. of sequences blasted against), there is a possibility of x hits to gain as much or higher score than your current hit. So an evalue of 0.001 is not same if the database was 10 proteins vs 10000 proteins.

John L Spouge

The E-value estimates the expected number of records in the database that will be returned with a score as good as or better than the score of the record under scrutiny. Hence, under the assumption of a Poisson distribution (which pertains in BLAST), "when E < 0.01, P-values and E-value are nearly identical".

The E-value threshold where real results transition into random noise varies from application to application (e.g., it is different for BLAST in proteins and BLAST in DNA), and it must be determined empirically in each application.

The estimate is often generated from a theoretical probability model that produces a "random database record". In BLASTP, e.g., the theoretical probability model produces a random protein sequence by stringing together independent letters chosen from the "Robinson-Robinson" frequency distribution, which approximates the amino acid composition within the protein database.

Rogelio Rodríguez-Sotres

The best account I have read, that is freely available is at the NCBI site:

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Best wishes

Rogelio

Eitan Rubin

My experience was (in the far past - I am not sure if the statistics was improved since) that you MAY get low E-values with unrelated DNA sequences (not proteins!). But that was 1E-10, not 1e-63. So I think you can interpret such a low e-value to mean "getting even one sequence with such similarity as this sequence shows to my query is very unlikely". As mentioned about, the probability of getting even one can be estimated to be about 1E-63.

Julian Gough

Wrong: Flores, Hluska, Chauhan, Yadav, Pillarisetti

Correct: Promponas

Nearly there: Spouge (it's expected number of records returned *by chance*, or expected number of *false* records)

An E-value is simply the expected number of errors per query. The emphasis being on PER QUERY.

A word of caution on how to use E-values: An E-value is similar to false discovery rate (FDR), and is an estimation of the number of errors per query (see the biological sequence analysis book (Durbin et al., 1998) for further reading). What this means is that for a given query, if you choose a cut-off E-value of e.g. 0.01, then you would expect 0.01 errors with an E-value lower than your cut-off. Using a cut-off of 0.01 you will have an error in your results one time in every 100 queries, which is a tolerable rate of error for most research investigations. The crucial part of the E-value definition is the ‘per query’ part, and this is particularly true when using SUPERFAMILY and Gene3D for whole genome analysis. The E-values change depending on the query. If you search two databases you will get two times the errors; this will produce the same number of errors as searching one database that has twice as many sequences. Therefore the E-value calculation depends on the database size, and the same sequence hit will have a different E-value if it is the result of a larger or smaller total query. The E-values in data bases such as SUPERFAMILY (supfam.org) are calculated as a single sequence searched against a model library, hence if you count the number of errors in a complete genome containing 1000 sequences, it will be 1000 times the E-value cut-off you choose. The cut-off displayed by SUPERFAMILY is < 0.0001), which means there will be one error in10 genomes of size 1000, or e.g. 5 errors in a genome that hypothetically has 50,000 sequences. Approximately half of the potentially false hits will be ignored because they conflict with a stronger, true assignment. However, the remaining errors may stand out because they do not make biological sense, so it is crucial to understand how to interpret the E-values when data mining; if you are looking for something unusual in thousands of genomes totalling millions of sequences you will find hundreds of errors using the default cut-off in the databases, several of which will appear biologically unusual.

Julian Gough

1e-63 is scientific notation for a small number, Hluska explains (that part) well.

Aras Rasul

why small E-value better score?

Raphael B Stricker

I think the question here is simply about notation. Why is it 1e-63 and not 4e-68 or 8e-12?

Keyvan Sobhani

nemidanam

Raphael B Stricker

Once again, the question that nobody has answered is simply about notation. Why is it 1e-63 and not 2e-23 or 8e-12?

Rogelio Rodríguez-Sotres

Answering to Raphael's comment...

The notation is simply that of ONE very small number, but not any small number. For the type of problem in question, the expectancy can be estimated as E= K m n exp(- lambda S), where m and n are the length of the sequences (target and query), K and lambda are descriptors for the mathematical distribution of high-scoring segments in a large set of random sequences aligned to the target sequence, and S is the the score of the query sequence. This formula (and others equivalent, which depend on how the score S is obtained) means exactly how frequently would the S score (or higher) shall be found if I align a really large number of random sequences to the target. Given the over-astronomical combinatorial possibilities of biological sequences (proteins or NA), a really good alignment is rather infrequently found by chance. Given defined values for K, m, n, lambda and S, the value is not arbitrary, it gets exactly defined.

for more detail please go to the document:

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head2

which I quoted in my previous comment.

Deepak Jadhav

1*10^-63

Rajiv Kumar

Sometimes we are getting e value like 5e-124, 3e-71, 6e-153, 1e-115. in BLAST search. Which would be considered as best search? If anybody explain please.

Carlos Lara-Romero

Is a good option take a look to ncbi help page. There is some good definitios of E-values and other statistics such as identity o query cover.

http://www.ncbi.nlm.nih.gov/books/NBK62051/

Best,

Carlos

Nese Akis

Now it is clearer. thanks..

Raphael B Stricker

Best answer yet! So the number before e is the number of decimal places in front of the negative number after e, or the multiplier of the positive number after e. Thank you!

Rogelio Rodríguez-Sotres

I do not mean to be rude, but the answer by Mehdi Roshdi Maleki makes little sense to me. E-values in NCBI blast searches are estimations based on bit scores, and their statistical meaning is clearly explained in an article I mentioned in my first comment to this queue (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html).

The BLAST output notation is a computational notation 1E-63 means literally:

1 x 10-63

Now, bit scores in alignments measure the coincidence of two sequences, and are obtained using a "substitution matrix" over those segments (for nucleic acids the substitution matrix is an identity matrix). A summation of the comparison made position by position measuring the relatedness of the "symbols" at each pair produces an "relative" score.

In turn, substitution matrices are artificial sets of numbers indicative of how conservative is to replace a residue by another. "Conservative" here can have many definitions (size, chemical features, number and likelihood of mutations required to make the change in the gene CDS, and so on), the two most common are the PAM scores (based on a "mutational distance" artificial scale) and the BLOWSUM (based on substitution frequencies as observed in biology). In the simplest case, two symbols can be considered conserved if identical (scoring 1) and non-conserved otherwise (scoring 0).

The statistic ground of alignment quality is based on ungapped alignments, but there is good evidence showing that ungappend alignments can be judged with the same tools.

Now the E-value is an estimation of how likely is to find two random sequences aligned with an "Score" a good as the one we observed for the query and the hit, given the size of the sequence length and the size of the database searched.

The score itself is an artificial estimate and will depend of the sizes of the query, database, and of the true (theoretical) search space. The size of our sequence and database are known, but the theoretical search space can only be estimated roughly.

To make the scores more meaningful, BLAST bit scores are scaled and normalized to have standardized units, no longer dependent of query and DB sizes.

The E-value is an exponential (base 2) unscaled form of the bit score:

E = m n 2-S' , where S' is a normalized bit score, and n, m are sequence-lengths of the query sequence and the database.

The probability of finding a random sequence giving a high scoring segment (HSP) with an Score S' equal to, or higher than, the one we are observing follows the Poisson distribution; therefore, the probability of finding at least one HSP, scoring as good as (or better than) the one we observed, can be estimated with:

P = 1- e-E.

Therefore, for very small E-values (less than 0.01), the P-value is (in practice) the same as the E-value (the P value would be 0.00691). For numbers bigger than 0.05 (P=0.0341), the statistics is telling you "If we make-up random sequences (of the same size as your query) and repeat your search, we are likely to find roughly 3 cases, out of a 100, where such random sequence will hit a target and give us the same score your just found with your query sequence". In other words, you may have hit the target by chance.

THEREFORE, E-VALUES:

A) Are estimates, not exact values, for the precise size of the "true" search space is not known.

B) These do not give you probabilities directly, but are closely related the "null hypothesis" error probability usually found in many classic statistical tests.

C) These probabilities DO DEPEND on the query sequence length and the target database size.

D) So the size of the database and the sequence length may prevent you from using E-values to compare the results of two searches. But you may use the bit score instead.

For instance, if you make your search for a polypeptide sequence in the nr-database and then you repeat the search in the uniprot-sprot database, and find the same hit, your E-value should be considerably smaller (~5 times) in the second case. Do compare these two results for the same sequence (KRNKALKKIRKLQKRGLIQMT) finding the same top hit.

Score: 69.8 bits, E-value:2e-14 (DB: nr)

Socre: 69.8 bits, E-value: 4e-15 (DB: uniprot-sprot)

E-values are differente, but the Bit score is standarized and normalized and can be compared (identical for these two identical alignments).

Remember that science requires formality, because results must be repeatable, not just by me, but by others too. A formal and unambiguous definition of every parameter you employ is thus required, so others are provided with reliable tools to repeat and extend your observations and/or experiments. Formal definitions do imply a formal interpretation of values.

best wishes,

Rogelio

Alastair R Tanner

The answer from Mehdi Roshdi Maleki is entirely wrong, shockingly so.

This thread is very good example of why we need moderators on Research Gate - that kind of answer should be deleted, and the contributer warned for breaking community standards by giving such misinformation.

Raphael B Stricker

In the absence of an omniscient RG moderator, we will all learn from our mistakes when they are pointed out to us. And working through those mistakes is a helpful exercise for everyone.

Rogelio Rodríguez-Sotres

I agree with Alastair, unfortunately, moderators would probably spend a lot of their time moderating the lists, without little, if any, reward to their effort.

In this very same site a discussion about peer reviewing on Journals reached one unavoidable conclusion: "with all of its limitations, having peer reviewing is better than go without it". Accordingly, participation in peer reviewing has recognition by the community (i.e. a reward). What would be the reward to list moderators in this site?

Rogelio Rodríguez-Sotres

Raphael has a point

Raphael B Stricker

Thank you, Rogelio. Perhaps we should move on to a discussion of open peer review.

Lutimba Stuart

The number of different alignments with scores equivalent to or better than the query that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

Pascal Bochet

How can E values at the same time:

- be the product of p-value by the size of the database (as stated by Amit Kumar Yadav) and therefore have a constant ratio ?

- be related by P = 1 - exp(-E) and therefore be nearly identical if E is small enough (as stated by Regelio Rodrigues-Sotres).

Vasilis J Promponas

Hi Pascal, I just saw your question.

Only the second statement you mention is correct. For details please refer to https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html under the section "P-values".

You can easily see why the second statement holds if you plot the function y = 1 - exp(x) ... (see here https://drive.google.com/file/d/1DQFC7DVDgLvMi_XT6qaUsRpdJjUghgqk/view?usp=sharing)

I hope this helps.

Best,

Vasilis

Shan Vinoth

Hi to all, I want to know e-value are based on similarity sequences or the length of the sequences

Rogelio Rodríguez-Sotres

The grounds of the e-value calculation are complex. You can read more about it at:

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

(cited in a previous comment)

Where you can get a good idea of how this statistics is produced.

You may also read the contribution i made to this thread 2 years ago.

Summarizing:

Bit scores in alignments measure the coincidence of two sequences by the use of a "substitution matrix". A summation of the comparison made position by position measuring the relatedness of the "symbols" at each pair produces an "relative" score. The score is completed by substrating gap penalties. Obviously, his score depends on the length of the alignment.

Then BLAST bit scores are scaled and normalized to have standardized units, no longer dependent of query and DB sizes. The E-value is an exponential (base 2) unscaled form of the bit score.

Best wishes,

Rogelio

Irina St Louis

So, the lowest E value means the highest similarity between sequences?

Raphael B Stricker

Yes. See link in previous response from Dr. Rodriguez-Sotres.

Rogelio Rodríguez-Sotres

The lower the E-value, the less likely that the similarity you are detecting is just an accidental coincidence.

E-values are similar to the error probabilities we use in classical statistical tests, except probabilities go from 1 to zero, while E-values go from +infinite to zero.

E-values equal to 1 or above indicate the lack of statistical significance.

Audrey Vanya

E-value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. As Rogelio Rodríguez-Sotres mentioned, the lower the E-value, or the closer it is to zero, the more "significant" the match is.

Audrey Vanya

Irina St Louis , yes. The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.

Shan Vinoth

Thank you Audrey

Subham Panda

i want to thank all of you brilliant people out there. these answers have made my life so easy and prosperous.

it feels so great to be on this planet

Badges
Science topic

Similar topics
Bioinformatics
Bioinformatics and Computational Biology

More Prem Prakash's questions See All

How to increase the pH tolerance of an Enzyme ?

There is an enzyme which pH optima is 6.5, we want to make this enzyme working well at pH 4.5, without disturbing its catalytic efficiency for a reason of industrial fermentation process. The...

06 July 2018 8,991 6 View

Is it possible that Km of key residue mutant is similar to wild type enzyme ?

I am getting Km value of a mutant which is similar to its wild type. How it is possible ? The residue is quite important for substrate binding. Is it possible ? please suggest !

10 November 2017 5,555 4 View

Is it fine to mutate some conserved residues in the active site ?

I am planning to make the wild type enzyme pH tolerant (acidic) without affecting the activity of the native enzyme. Here i am thinking to introduce some residues like lysine or ariginine in place...

08 September 2017 1,388 5 View

Which tool is best for showing secondary structure element on the sequence alignment other than ESPRIPT (which can use the pdb file to do it) ?

I am trying to find some tool or server which put secondary structure elements on top of the sequence alignment which ESPRIPT does very well. But is there any other tool which I can use to show...

02 March 2015 787 5 View

Why the protein after Site Directed Mutagensis is not expressed ?

I have done whole plasmid amplication using XT-20 high fidelity enzyme from merck. After induction with 0.3mM IPTG i did not see any expression of the protein.I am suspecting some mutation has...

01 February 2015 3,166 6 View

How can I isolate double stranded DNA from single stranded in a mixture?

Please suggest any protocol which is cost effective and feasible. Thanks in Advance.

09 October 2014 6,861 5 View

Why is the Difference between Rwork and Rfree 10% when refined at 3.2 Angstrom?

I am little bothered that my structure refined at 3.2 Angstrom, have a difference of 10 % without and with solvent contents. The difference is not changing. completeness is 97.6 % at 3.2 Angstrom....

06 July 2014 6,907 8 View

How does one shade the conserved and non-conserved protein sequences after MSA?

I have used a boxshade server, but here I am finding difficulties as it puts sequence numbers here and there after the job. So does anybody have an idea to do the same without any erroneous job of...

06 July 2014 9,161 4 View

What is "Refmac sigma-AA weighted map data"?

I am creating a file for the ligand by generating the omit map in order to visualize it into pymol. Does anyone have any explanation what the Refmac sigma-AA weighted map data is, and why it is...

06 July 2014 7,204 5 View

How to improve the quality of my crystal?

As I have purified my protein and set crystallization screen and I tried various way to optimize the crystal quality, such as seeding, changing pH, PEG concentration, salt concentration and...

04 May 2014 8,624 10 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

Are there any instruments for studying time similar to the way it is in space?

There are a huge number of methods for studying objects in space, according to the senses (and not only). Mechanical, thermal, optical, acoustic, electrical, magnetic, based on particle beams,...

06 August 2024 7,102 0 View

Weak DAPI staining after immunohistochemistry - how to improve?

After immunohistochemistry of previously fixed in PFA and EtOH and then frozen 20 μm sections of zebrafish brain, DAPI staining is very weak (right) compared to the same sections stained without...

05 August 2024 9,637 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended? If no axillary lymph node dissection was not...

05 August 2024 8,056 1 View

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity? What is the acceptable percentage of error (regardless of the metric)? Could you suggest...

03 August 2024 5,358 0 View

Interested in a SCOPUS collaboration?

Hi RG family. My team and I are working on some SCOPUS publications and we need co-authors who are willing and capable of undertaking both qualitative and quantitative-based studies. The scope...

02 August 2024 7,843 0 View