Bioinformatics: automatically edit fasta identifiers based on a text file

01 January 1970 5 3K Report

Dear all

I have a question about an issue that I often find while doing my bioinformatic analyses and I often struggle to find a way around.

Let's say I have a fasta file like the one below

input.fasta

>ID1

ATGTGTCG.....

>ID2

CGCGTGTGATAT

>ID3

GCGCGCGCAAAA..

and then I have a tab-separated file where each ID is associated with a feature

input.txt

ID1 Nannochloropsis_oceanica

ID2 Nannochloropsis_gaditana

ID3 Nannochloropsis_oculata

and I'd like to edit the fasta identifiers by adding this feature, such that the desired output would be

output.fasta

>ID1_Nannochloropsis_oceanica

ATGTGTCG

>ID2_Nannochloropsis_gaditana

CGCGTGTGATAT

>ID3_Nannochloropsis_oculata

GCGCGCGCAAAA.

Does anyone know a simple unix/python/R code in order to automatically do that when working with thousands of sequences?

I used a python script from Tony Walters (https://gist.github.com/walterst/9147f9405cadf67a88471cc87b508333) to do that, but it is not working anymore in my own unix environment (some python error).

Thanks for your attention

Sergio

Mahmoud Ahmed Popular answer

Hi Sergio Balzano

A easy way to do that is to use bash. You can do something like that

$ ID=`awk '{print ">" $1 "_" $2}' input.txt`

$ FASTA=`grep -v '>' input.fasta`

$ paste -d '\n'

Balig Panossian

Here are some resources that i recommend:

1- Python: https://gist.github.com/crazyhottommy/a59df92b48e6ad3f4630#file-change_fasta_header-py

2- Python: https://www.biostars.org/p/301242/#301255

3- Perl: https://gist.github.com/cabraham03/a7d3457919b08ac556d8

Mahmoud Ahmed

Ming Yan

import sys

from Bio import SeqIO

input_fasta = sys.argv[1]

input_ID = sys.argv[2]

output_fasta = sys.argv[3]

fr = open(input_ID,"r")

Seq_ID = {}

for line in fr:

ID = line.strip().split("\t")[0]

Species_name = line.strip().split("\t")[1]

Seq_ID[ID] = Species_name

fw = open(output_fasta,"w")

for rec in SeqIO.parse(input_fasta,"fasta"):

fw.write(">%s_%s\n%s\n"%(rec.id,Seq_ID[rec.id],str(rec.seq)))

fw.close()

#Usage python Change_SeqID.py input.fasta input.txt output.fasta

Hope my simple code can help you!

Hassan Ebrahimi

I have made such simple tools in the past, for a friend who knows nothing about programming. My tools are .jar files that executes ( in both Windows and Linux systems), a window opens asking for input files and the rest of story. If you often need to prepare such data, I can modify one of them to serve your purpose for free.

Why Is My mtDNA Ct Value Higher Than Nuclear DNA in qPCR?

How to Freeze Embryos for Spatial Transcriptomics?

What environment factors affect plasmid conformation in a lab?

Why are power system based on a three-phase structure ?

Where is the centrifugal force in the three-body problem?

Tenerife before and after covid.. geo, antropologhya , economia and turism, society fenomenon, etc?

How make shell sections temperature DoFs equal ?

Problems isolating human neutrophils using Percoll - Erythrocytes not pelleting?

Problems in the calculations for classical and quantum physics?

What are the ways in which you get your documents read and recommended?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Request Python code?

How do I replace a file with a more recent version of a paper that was uploaded to ResearchGate?

How to solve g_mmpbsa error?

Why does everyone use vs code?

How do you delete a duplicate pdf for the same paper on ResearchGate?

AUX gas reading problem on QE with full MS and PRM method in one run?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

Need help with my research project on open source SIEM and machine learning?