How can I parse a GenBank file to retrieve specific gene sequences with ID's?

Alexander Thomas Baker @Alexander_Baker10

06 June 2018 8 1K Report

SOLVED: I have a GenBank file containing a large set of complete genomes with many different CDS. There is one gene within each of the genomes of interest to me.

I want to extract that gene, along with the organism ID and save it as a separate file (as Fasta, genbank, whatever) so I can perform alignments and various other downstream processing.

I have managed to get as far as using BioPython to print all the CDS', but I can't find a way to tell python that I only want the CDS's with certain products (my protein of interest). Code below:

from Bio import SeqIO

for rec in SeqIO.parse("GenBank_of_Genomes.gb", "gb"):

if rec.features:

for feature in rec.features:

if feature.type == "CDS":

print (feature.location)

print (seq_record.id)

print (feature.location.extract(rec).seq)

Can anyone help with this please? I realise this might not be trivial... I'm painfully fresh to all things computational!

>>>>>

EDIT: Adding the solution as an attached file, for the next person to have this problem. Massive thanks to Sanjay for all his help (see below).

The attached script looks through a genbank file and outputs all the CDS containing the name of the gene of interest.

I commented all over the script with my (basic) understanding of the code.

Output is in FASTA format, and includes the full accession number, protein ID, and taxonomy as pulled from each genbank entry. Enjoy!

Sanjay Kumar Srikakulam

Can you please post a sample file? It will make it easy for anyone to help and also describe what should be your matching criteria to extract the CDS

Alexander Thomas Baker

Hi Sanjay, I've uploaded the file 'RG_Test.gb'

It contains 2 Adenovirus Genomes (the real thing has many more genomes). Both contain a CDS with /product="hexon protein"

I want to be able to extract all CDS which contain /product="hexon protein"

Thanks for any help!

Sanjay Kumar Srikakulam

Hi,

I believe that the below script will meet your needs, hope this helps.

Alexander Thomas Baker

You've been an enormous help Sanjay, thanks!

Just to complicate matters, some of my products are named slightly differently i.e. /product="hexon" or /product="hexon protein"

I know I can sort it by using:

if "hexon" in val or "hexon protein" in val:

etc...

But this means I have to know the exact name of every instance. This if fine for a few files, but gets difficult with the massive sets I have (i.e. I might miss a name and never notice).

Do you know how I can change the if statement to search /product="hexon" for the word "hexon" and output all those sequences? regular expressions?

Thanks again!

Edit: Think I've solved it with:

if any("hexon" in s for s in val):

Sanjay Kumar Srikakulam

Hi,

Yeah, that should work. Glad you figured it!

Alexander Thomas Baker

Cheers Sanjay,

Last thing (and I swear I'll stop), is getting it save as a file format I can load for sequence alignment. fasta or genbank ideally.

Any experience of this?

Sanjay Kumar Srikakulam

Hi,

Cool, no problem!

Just change the following lines in the above script to the following

output_file_name = "rg_test.fasta"

ofile.write(">{0} {1}\n{2}\n\n".format(feature.qualifiers['protein_id'][0], val[0], feature.qualifiers['translation'][0]))

Alexander Thomas Baker

Sorted. Thanks loads Sanjay, really is appreciated. I attached the final parser to the question.

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Request Python code?

Why does everyone use vs code?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

Need help with my research project on open source SIEM and machine learning?

Simulation of metal drawing by Abaqus with UMAT?

Can we convert a thousand of FASTA sequence in numeric form in .csv format? If yes kindly send me the script for the same?

Does post-translational protein modification cause devisions on observed pI verses calculated pI?

How to do FEL analysis?