As the major part of the answers cover the homology aspect of your question, i want to add some notes on identity and similarity, as those are very often used interchangeably.
Sequence identity is the amount of characters which match exactly between two different sequences. Hereby, gaps are not counted and the measurement is relational to the shorter of the two sequences. This has the effect that sequence identity is not transitive, i.e. if sequence A=B and B=C then A is not necessarily equal C (in terms of the identity distance measure) :
A: AAGGCTT
B: AAGGC
C:AAGGCAT
Here identity(A,B)=100% (5 identical nucleotides / min(length(A),length(B))).
Identity(B,C)=100%, but identity(A,C)=85% ((6 identical nucleotides / 7)). So 100% identity does not mean two sequences are the same.
Sequence similarity is first of all a general description of a relationship but nevertheless its more or less common practice to define similarity as an optimal matching problem (for sequence alignments or unless defined otherwise). Hereby, the optimal matching algorithm finds the minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned (edit distance). Using this, the percentage sequence similarity of the examples above are sim(A,B)=60%, sim(B,C)=60%, sim(A,C)=86% (semi-global, sim=1-(edit distance/unaligned length of the shorter sequence)). But there are other ways to define similarity between two objects (e.g. using tertiary strucure of proteins).
An then you might start to conclude from similarity to homology, but this was already covered sufficiently .
Homology is an evolutionary concept, it has to do with common ancestry. Frequently, in DNA or protein sequences it can be inferred based on the identity / similarity of sequences among related organisms (but as fas as I know, although you can represent similarity /identity in %, you can't report % of homology).
I'm sure this text will help you clarify the terms:
DNA makes RNA makes protein and protein sequence determines structure(secondary and tertiary) structure determines function. Homologous proteins are homologous based on their function, proteins are distantly homologous have substantial differences in their sequence (similarity and identity). The concept of homology modelling in protein modeling depends on sequence similarity and identity. For any protein template (PDB structure) has to have more then 60% similarity / identity else it is difficult to make homology model. Hope this makes your concept clear.
I am sorry Nilanjan Roy and Snijesh Vp, but I have to disagree with you both. Homology is not necessarily related to function. Even for structures such as the human hand, bat wing, and whale flipper, they all have different functions, but have simmilar structures and a common ancestor, so they are considered homologous.
The same applies to proteins. Some proteins, specially coming from paralogous genes (homologs within the same genome) are copies of a gene that can eventually differ in function through evolution (or lose function): http://www.nature.com/nrm/journal/v13/n8/full/nrm3392.html
However, even with a different biochemical function, they are considered homologs based on their common origin (inferred from sequence similarty/identity).
Homology is often manifested by significant similarity in nucleotide or amino acid sequence and almost always manifested in three-dimensional structure.
Hi Manish, This is a hot issue in evolution and a key assumption when dealing with ancestral relationships. In principle, homology is a qualitative (yes or no) statement about a given trait in the context of common, shared ancestry. Just to indulge on a metaphorical example: one is or isn't a biological son of his father. Likewise, on a broader sense, there is ancestral relationship (homology) or not among traits. Therefore it is meaningless and wrong to state "percent homology" (one is not 23% son of his mother). Moreover, as Gustavo said above, homology is an inferred contrastive attribute, in the sense that homology is the outcome of a comparison of traits, not a given. Since ancestrally unrelated traits may converge and become similar in form and function in evolutionary time, the issue of establishing homology based on the amount of similarity is always thorny and requires criteria and caution. The problem comes from making a qualitative statement (homology) based on quantitative extant information (similarity). A high level of similarity among traits that may even share identical features is usually taken as indicative of homology. In the case of proteins, sequence similarity above random expectation has been used to support homology. Nevertheless, as Nilanjan said, shared structural features may last longer than sequence similarity, but here care may be taken as well, since there are instances of structural convergence among non-homologous proteins. Identity may be self explanatory, I would guess.
Homology is a simple concept that turns complicated in practive. Homology exists i.e. objects are homologous when they share an evolutionary ancestor - have evolved divergently from it. Thus, it's a binary concept - no % homology please! (PMID: 20696735)
As others say, protein structure is generally more conserved than protein sequence eg 30% or less identical homoologues may be structurally very similar.
Inferring homology is usually done by comparing sequences or structures. %sequence identity, %sequence similarity or other metrics will results and need to be interpreted. Reported values may depend somewhat on the algorithm chosen (PMID: 15130466)
Thresholds for reliable inference of of homology are tricky and, ultimately, it may not be possible to say definitively whether two similar structures with little sequnce relationship, say, are distant homologs or the result of convergent evolution.
As the major part of the answers cover the homology aspect of your question, i want to add some notes on identity and similarity, as those are very often used interchangeably.
Sequence identity is the amount of characters which match exactly between two different sequences. Hereby, gaps are not counted and the measurement is relational to the shorter of the two sequences. This has the effect that sequence identity is not transitive, i.e. if sequence A=B and B=C then A is not necessarily equal C (in terms of the identity distance measure) :
A: AAGGCTT
B: AAGGC
C:AAGGCAT
Here identity(A,B)=100% (5 identical nucleotides / min(length(A),length(B))).
Identity(B,C)=100%, but identity(A,C)=85% ((6 identical nucleotides / 7)). So 100% identity does not mean two sequences are the same.
Sequence similarity is first of all a general description of a relationship but nevertheless its more or less common practice to define similarity as an optimal matching problem (for sequence alignments or unless defined otherwise). Hereby, the optimal matching algorithm finds the minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned (edit distance). Using this, the percentage sequence similarity of the examples above are sim(A,B)=60%, sim(B,C)=60%, sim(A,C)=86% (semi-global, sim=1-(edit distance/unaligned length of the shorter sequence)). But there are other ways to define similarity between two objects (e.g. using tertiary strucure of proteins).
An then you might start to conclude from similarity to homology, but this was already covered sufficiently .
Just now seen your question when searching for my project .
I found this link, it is so helpful to get the difference between similarity and homology concepts. http://genetics.wustl.edu/bio5488/files/2013/02/130204_TW_Homology_I.pdf . The post is old more than one year but definitely will help those who seek to know the answer.
as I see similarity and identity sequences percentage rate effect changes organism strain or lineage, then homologous of strain change to heterologous strain...
Identity is having the same base or amino acid (exact match; no substitution/mutation) at an equivalent position obtained in an optimal alignment. It can be quantified and normalised. Identity is reported in percentage.
Similarity is also expressed in % and is computed by considering all identitical and favourable substitutions. For example K and R replacement (both are basic amino acids) or purine-purine substitution. Most similarity scoring matrices give higher scores for such favourable replacements where a property is conserved, even if amino acid /base is changed (non identity). However some substitutions may bring a functional change and hence not considered “similar”, for example, replacement of amino acid K (basic) with D (acidic).
Both identity and similarity are used to deduce homology. Homology, however has a specific definition- having a common evolutionary ancesstor. Therefore, sequences are either homologous or not. Homology is a qualitative description of the relationship and should never be expressed as %homology. Homologs share more than 35% of identity and similarity (except remote homologs). In case of remote homologs, one needs additional experimental evidence to prove evolutionary relationships.
Homologs are of 2 types orthologs (same gene in different species) and paralogs (homologs within a species). Orthogs carry out same function in different species. Homologs are sequentially similar and share structural and functional similarity. However, similarity (alone) doesn’t necessarily always mean and/or indicate homology.