I have downloaded 1000 protein sequences having very big name in FASTA format. I want to cut short these name in a quick manner. Is it possible through and command line or software or I will have to do it manually one by one.
The Sequence Name Annotation-based Designer (SNAD) at:
http://veb.lumc.nl/SNAD/index.cgi
is very good for this. It can take an alignment, a tree (Newick format for example), or a list of names, and convert the names based on GenBank or UniProt database annotation.
I am attaching an output for your data, where I asked for genus and species information for each sequence, but given that they are all from H pylori, this was not a good choice.
The LANL HIV Databases has a tool for taking an alignment plus a spreadsheet of data to rename the sequences in the alignment:
The Sequence Name Annotation-based Designer (SNAD) at:
http://veb.lumc.nl/SNAD/index.cgi
is very good for this. It can take an alignment, a tree (Newick format for example), or a list of names, and convert the names based on GenBank or UniProt database annotation.
I am attaching an output for your data, where I asked for genus and species information for each sequence, but given that they are all from H pylori, this was not a good choice.
The LANL HIV Databases has a tool for taking an alignment plus a spreadsheet of data to rename the sequences in the alignment:
Both previous answers are helpful. If you in the end don’t care much about the sequence names anymore, you can also simply replace them by short string and a number or similar (https://github.com/Gaius-Augustus/Augustus/blob/master/scripts/simplifyFastaHeaders.pl). I do this if some software may be unhappy with the | character that will still be there using the first answer, or if I need even shorter headers.