SOLVED: I have a GenBank file containing a large set of complete genomes with many different CDS. There is one gene within each of the genomes of interest to me.

I want to extract that gene, along with the organism ID and save it as a separate file (as Fasta, genbank, whatever) so I can perform alignments and various other downstream processing.

I have managed to get as far as using BioPython to print all the CDS', but I can't find a way to tell python that I only want the CDS's with certain products (my protein of interest). Code below:

from Bio import SeqIO

for rec in SeqIO.parse("GenBank_of_Genomes.gb", "gb"):

if rec.features:

for feature in rec.features:

if feature.type == "CDS":

print (feature.location)

print (seq_record.id)

print (feature.location.extract(rec).seq)

Can anyone help with this please? I realise this might not be trivial... I'm painfully fresh to all things computational!

>>>>>

EDIT: Adding the solution as an attached file, for the next person to have this problem. Massive thanks to Sanjay for all his help (see below).

The attached script looks through a genbank file and outputs all the CDS containing the name of the gene of interest.

I commented all over the script with my (basic) understanding of the code.

Output is in FASTA format, and includes the full accession number, protein ID, and taxonomy as pulled from each genbank entry. Enjoy!

Similar questions and discussions