How to generate a new FASTA from an assembly-assembly mapping ?

20 February 2018 3 6K Report

Hi everyone,

As you may know, the NCBI offer a "assembly-assembly remapping service" which basically map contigs of one genome (let's call it Genome 1) to the assembly of another genome (lets call it GenomeRef)

My purpose is to generate a new file (FASTA) which consists on all the sequences of my mapped contigs (in the right order depending on the mapping) and replace the GenomeRef positions which have not been mapped by N .

I downloaded the GFF file provided on the NCBI website :

Example of a line :

Contig1 RefSeq match 49000 49020 . + . ID=xxxxxxxxxxxx;Target=GenomeRefContig5 6467734 6467754 +;best_on_query=1;best_on_query_same_unit=1;best_on_subject=1;best_on_subject_same_unit=1;gap_count=0;genomic_

to_genomic=1;num_ident=21;num_mismatch=0;pct_coverage=0.000282069;pct_coverage_hiqual=0.000282069;pct_ident_quantized=98;pct_identity_gap=100;pct_identity_gapopen_only=100;pct_identity_ungap=100;reciprocity=3;same_unit_reciprocity=3

The problems i am facing while doing it are :

- The GFF file is not sorted depending on the GenomeRef sequences but on the Genome 1 contigs

- I think i would be able to use samtools faidx and a for loop to retrieve all my contigs sequences from the Genome 1 fasta file but how to replace gaps between those contigs by N depending on the gap length of the GenomeRef ?

[There will be a lot of gaps since the Genome 1 is a very poorly sequenced genome while the GenomeRef is a greatly sequenced genome (PacBio)]

I hope it is clear,

Thanks a lot !

Maxime Policarpo

I have already been able to do ->

Remove the extra informations i don't want :

> cut -f1,4,5,9 -d$'\t' file.gff | sed 's/ID.*Target=//g' | sed 's/-.*\|+.*//g' > simplified.gff

Sort the simplified file depending on the Genome ref scaffold and remove headers :

> sort -k4 simplified_gff > final.gff

> sed 's/#!gff//g' final_gff.gff | sed 's/##gff//g' | sed 's/#!processor NCBI annotwriter//g' > MyGFF.gff

> sed -i '/^$/d' MyGFF.gff

Then i seperated every informations in separated files :

>while read line ; do echo "$line" | cut -f1 ; done < MyGFF.gff >> line1 #Name of Genome 1 contig

>while read line ; do echo "$line" | cut -f2 ; done < MyGFF.gff >> line2 #Start pos of genome 1 contig

>while read line ; do echo "$line" | cut -f3 ; done < MyGFF.gff >> line3 #End pos of genome 1 contig

>while read line ; do echo "$line" | cut -f4 ; done < MyGFF.gff >> ligneINTER

>while read line ; do echo "$line" | cut -f1 -d " " ; done < ligneINTER >> line4 #Name of GenomeRef scaffold

>while read line ; do echo "$line" | cut -f2 -d " " ; done < ligneINTER >> line5 #Start pos of GenomeRef scaffold

>while read line ; do echo "$line" | cut -f3 -d " " ; done < ligneINTER >> line6 #End pos of GenomeRef scaffold

Then for each line, i can make a samtools to catch the sequence of my Genome1 and add it to a file that have the GenomeRef scaffold as name :

####wc -l file.gff = X

####END = X

>for i in $(seq 1 $END) ; do samtools faidx Genome1.fasta `sed "${i}q;d" line1`:`sed "${i}q;d" line2`-`sed "${i}q;d" line3` >> `sed "${i}q;d" line4`.fasta ;done

But i really don't see how i can add the N's ....

Abhijeet Singh

In such a complicated code situation, I would recommend you to use the GUI based analysis, which is well documented on NCBI https://www.ncbi.nlm.nih.gov/genome/tools/remap/

Furthermore, try to use NCBI genome work bench which is easy to use and can do your job.

Thanks for answering,

the GUI is indeed a good start to begin my analyses but generating a FASTA file would be way better and simplier.

How long to treat male mouse to affect the offspring ? When is the paternal programmation efficient ?

What is the 70kDa band in Western blot of full mouse heart?

How long does it takes to digest ~20mg of tissue with Proteinase K at 56°c ?

Is it possible to display, percentage of limitation speed reached on PVT Vissim ?

Is it possible to stain cells after flowcytometry?

Strange immunostaining of P-gp on Caco-2/TC7 cells ?

What are the best standardized tests to establish cognitive abilities/profile of participants with complete blindness or severe visual impairment?

How to perform Western blot on low cell count?

Evan's blue dye is a relevant technic to verify the blood brain barrier integrity after viral infection in small rodents?

Can the electron mobility be estimated with tunneling spectroscopy?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

Who of all the Global Scientific community will help me Prof. Dr. Yoshida make way for TPEOM, MEC ~EMC to return the atmospheric gases to the norma ?

How is the bacterial genome's high protein count verified as genuine despite 800+ contigs and good metrics (98.55%completeness, 0.68% contamination)?

Recovery Viurses from bacteria genome?

Can we convert a thousand of FASTA sequence in numeric form in .csv format? If yes kindly send me the script for the same?

Why my negative control siRNA is decreasing the target gene's expression?

How to use NCBI datasets ?

Enquire about the calculation of percentage of reads mapped to the reference sequence?

Promotor observation in region annotation of RNA RIP-seq?

Where can I get PCR Primers for COI barcoding of birds?