I want to eliminate redundant and incomplete sequences from my PSI-BAST result of a protein for multiple sequence alignment and phylogenetic analysis. Kindly suggest how to do it.
I think you might want to cluster your sequences for selecting only representative sequences for your clusters. However you will need to know in advance the identity threshold for your case. You can do this with tools like cd-hit or usearch. For example for CD-HIT:
cd-hit -i input.fasta -o input -c 0.9
and then you will have a file with representatives sequences at 90% identity.
I think you might want to cluster your sequences for selecting only representative sequences for your clusters. However you will need to know in advance the identity threshold for your case. You can do this with tools like cd-hit or usearch. For example for CD-HIT:
cd-hit -i input.fasta -o input -c 0.9
and then you will have a file with representatives sequences at 90% identity.
I can tell you how I personally do in such case, but the issue for you to use it will be the number of your sequences... and their overall quality.
I manually use MEGA. I first align my sequences using MUSCLE with a huge gap opening penalty (like -5 to -10), like this you will easily see the shorter sequences/incomplete/too distant. Remove by Crtl X.
To remove redundant proteins, I just start a NJ tree with p-distance as model (in order to get that tree quickly). If two sequences are identical, then they will be branch together with no distance (horizontal) between them.
Usually I try to use UniprotKB for such work. It has a BLAST option too, at the difference that you can see if the protein is reviewed or not. Maybe easier to also get information such as provenance (mRNA or protein) and then decide if it is a shortened incomplete form or just a natural short isoform. All depend of your sample size and diversity...