Dear All,
I have rna-seq data of 3 ginger sample of which are from different condition.
I have assembled them using trinity tool, hence i got three different transcript assembly.
Since public database doesn't have reference genome of ginger, therefore to build own custom transcript genome of ginger i merged and cluster all three trinity assemblies following this workflow
Workflow:-
# Merging all samples trinity assemblies
cat 0-trinity_out/Unigenes_S1.fasta 0-trinity_out/Unigenes_S3.fasta 0-trinity_out/Unigenes_S4.fasta > 0-trinity_out/Unigenes_all.fasta
# Clustring merged assemlies at 100% identity
cd-hit-est -i 0-trinity_out/Unigenes_all.fasta -o 1-cdhit_out/cdhit_clstr.fasta -c 1.0 -n 8 -g 1 -r 1 -B 1 -p 1 &> 1-cdhit_out/cdhit_clstr.log
cat 1-cdhit_out/cdhit_clstr.fasta | python /home/blab/myscript/rename_multifasta.py Seq > 2-tgicl_out/cdhit_clstr.fasta
read_fasta -i 2-tgicl_out/cdhit_clstr.fasta | write_fasta -x -o 2-tgicl_out/cdhit_clstr_singleline.fasta
mv 2-tgicl_out/cdhit_clstr_singleline.fasta 2-tgicl_out/cdhit_clstr.fasta
# Further clustring using TGICL
$TGICL -F cdhit_clstr.fasta -p 99 -l 50 -v 100 -c 3 -L -b blab:cbt@soa:mysql:127.0.0.1:tgicldb -R
cat asm_1/contigs asm_2/contigs asm_3/contigs asm_1/singlets asm_2/singlets asm_3/singlets > final_contigs.fasta
# EvidentialGene::tr2aacds.pl pipeline script for processing final_contigs.fasta into the most biologically useful "best" set of mRNA, classified into "primary" and "alternate" transcripts.
/home/blab/Downloads/evigene/scripts/prot/tr2aacds_qsub.sh
# The output is a neat pile of "okay" and "drop" transcripts
# The "okay" set is annotated using trinotate tool to used as custom transcript genome.
# tr2aacds_qsub file is attached here.