I am stuck with a 64 Mbit file containing about 45000 bacterial sequences in fasta format from a NGS-sequencing project with a PacBio RS. The sequences are about 1429 bps long and from a hyper-variable 16S rRNA region. I also know that they are not "rubbish", but belong to the expected bacterial genera (an alignment and BLASTn analysis was done, but I got only the fasta-file and a summary report which genera and how many counts etc.; I cannot ask for another analysis). This means, though an assignment was done for the sequences, the sequence descriptors have just cryptic numbers, therefore are useless for my planned analyses.

I need a genus name as sequence descriptor (from an assignment) and then a possibility to edit the file in a way that I have only a few genera of interest, for which I can do separately phylogenetic analyses.

So how can I do something like a BLASTn analysis (I know that the website refuses such large files). I tried it with Kaiju webserver, but the results were not as expected (completely different and mostly impossible genera, or no alignments at all). I'm not sure if a standalone BLAST would be possible, bec of database size and quite cryptic command line code. Or is it better split the file (which software for efficient editing of fasta-files is there?) and upload them one after another? Or any other ideas? Thanks a lot in advance!

More Peter Hondelmann's questions See All
Similar questions and discussions