Dear All,

I have rna-seq data of 3 ginger sample of which are from different condition.

I have assembled them using trinity tool, hence i got three different transcript assembly.

Since public database doesn't have reference genome of ginger, therefore to build own custom transcript genome of ginger i merged and cluster all three trinity assemblies following this workflow

Workflow:-

# Merging all samples trinity assemblies

cat 0-trinity_out/Unigenes_S1.fasta 0-trinity_out/Unigenes_S3.fasta 0-trinity_out/Unigenes_S4.fasta > 0-trinity_out/Unigenes_all.fasta

# Clustring merged assemlies at 100% identity

cd-hit-est -i 0-trinity_out/Unigenes_all.fasta -o 1-cdhit_out/cdhit_clstr.fasta -c 1.0 -n 8 -g 1 -r 1 -B 1 -p 1 &> 1-cdhit_out/cdhit_clstr.log

cat 1-cdhit_out/cdhit_clstr.fasta | python /home/blab/myscript/rename_multifasta.py Seq > 2-tgicl_out/cdhit_clstr.fasta

read_fasta -i 2-tgicl_out/cdhit_clstr.fasta | write_fasta -x -o 2-tgicl_out/cdhit_clstr_singleline.fasta

mv 2-tgicl_out/cdhit_clstr_singleline.fasta 2-tgicl_out/cdhit_clstr.fasta

# Further clustring using TGICL

$TGICL -F cdhit_clstr.fasta -p 99 -l 50 -v 100 -c 3 -L -b blab:cbt@soa:mysql:127.0.0.1:tgicldb -R

cat asm_1/contigs asm_2/contigs asm_3/contigs asm_1/singlets asm_2/singlets asm_3/singlets > final_contigs.fasta

# EvidentialGene::tr2aacds.pl pipeline script for processing final_contigs.fasta into the most biologically useful "best" set of mRNA, classified into "primary" and "alternate" transcripts.

/home/blab/Downloads/evigene/scripts/prot/tr2aacds_qsub.sh

# The output is a neat pile of "okay" and "drop" transcripts

# The "okay" set is annotated using trinotate tool to used as custom transcript genome.

# tr2aacds_qsub file is attached here.

Similar questions and discussions