Comments regarding NGS (WES) analysis workflow?

18 April 2022 0 8K Report

Hi everyone,

I need to analyze exome sequencing data (normal-tumor paired samples) for learning and also for a project. The analysis is simple but I lack experience, so I need help.

To begin with, I followed few tutorials, GATK best practices guidelines, and other internet resources, and ended up with this sequence of commands to run the entire workflow. My goal is to find out mutated/altered/deleted/truncated genes in tumor samples (gastric cancer) and then link them with tumor initiation. Can anyone comment on this sequence whether it is correct, or does it need any modification? Thank you in advance.

My sequence of commands is as (I deleted comments to make it shorter):

#!/bin/bash

#It will stop the file at first error

set -e

dir=/mnt/gc_exom_analysis/resources_broad_gatk2/

#The reference and vcf directories

REF=/mnt/gc_exom_analysis/resources_broad_gatk2/hg38_v0_Homo_sapiens_assembly38_1.fasta

snpEff=/home/fsbserver/snpEff/snpEff.jar

SnpSift=/home/fsbserver/snpEff/SnpSift.jar

#For variant; list is long

VCF1=/mnt/gc_exom_analysis/resources_broad_gatk2/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf

#Dummy read group info can be added to bypass error;

# ID = ReadGroup identifier; PU =plateform Unit SM = sample

# PL = plateform technology used to produce the read

# LB DNA preparation library identifier

# to add dummy read groups

RG="@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1"

#bwa mem2 alignment

bwa-mem2 mem -t 12 -M -R $RG $REF *1_N.fastq.gz *2_N.fastq.gz > normal.sam

bwa-mem2 mem -t 12 -M -R $RG $REF *_2_T.fastq.gz > tumor.sam

# sort the alignment by coordinates

picard SortSam INPUT=normal.sam OUTPUT=normal_sorted.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=/mnt/temp_dir

picard SortSam INPUT=tumor.sam OUTPUT=tumor_sorted.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=/mnt/temp_dir

# computing statistics to see how well the reads aligned to the reference genome.

samtools flagstat normal_sorted.bam > align_mat_normal.txt

samtools flagstat tumor_sorted.bam > align_mat_tumor.txt

picard MarkDuplicates INPUT=normal_sorted.bam \

OUTPUT=normal_sort_marked.bam \

METRICS_FILE=align_mat_normal.txt \

ASSUME_SORTED=true \

VALIDATION_STRINGENCY=SILENT

picard MarkDuplicates INPUT=tumor_sorted.bam \

OUTPUT=tumor_sort_marked.bam \

METRICS_FILE=align_mat_tumor.txt \

ASSUME_SORTED=true \

VALIDATION_STRINGENCY=SILENT

#index generation

samtools index normal_sort_marked.bam

samtools index tumor_sort_marked.bam

#Base Recalibrator has been divided into two steps:

# (1) calculate base frequencies using BaseRecalibrator

# (2) apply base recalibration using "GATK ApplyBQSR"

#This tool performs the second pass in a two-stage process called Base Quality Score Recalibration (BQSR).

#Specifically, it recalibrates the base qualities of the input reads based on the recalibration table produced by the BaseRecalibrator tool,

#and outputs a recalibrated BAM or CRAM file.

gatk BaseRecalibrator -R $REF --known-sites $VCF1 \

-I normal_sort_marked.bam -O normal_bqsr.table

gatk BaseRecalibrator -R $REF --known-sites $VCF1 \

-I tumor_sort_marked.bam -O tumor_bqsr.table

gatk ApplyBQSR -R $REF -I normal_sort_marked.bam --bqsr-recal-file normal_bqsr.table -O normal_bqsr.bam

gatk ApplyBQSR -R $REF -I tumor_sort_marked.bam --bqsr-recal-file tumor_bqsr.table -O tumor_bqsr.bam

#PrintRead is to filter reads based on various criteria

#The default filter is "WellFormedReadFilter" and along with this, additional filter can be applied

gatk PrintReads -R $REF -I normal_bqsr.bam --read-filter NotDuplicateReadFilter -O normal_filter_PR.bam

gatk PrintReads -R $REF -I tumor_bqsr.bam --read-filter NotDuplicateReadFilter -O tumor_filter_PR.bam

# Call somatic short mutations via local assembly of haplotypes.

# Short mutations include single nucleotide (SNA) and insertion and deletion (indel) alterations.

# The caller uses a Bayesian somatic genotyping model that differs from the original

# MuTect by Cibulskis et al., 2013 and uses the assembly-based machinery of HaplotypeCaller. Of note, Mutect2 v4.1.0.0 onwards enables joint analysis of multiple samples.

gatk Mutect2 -R $REF -I normal_filter_PR.bam \

-I tumor_filter_PR.bam \

-germline-resource $dir/somatic-hg38_af-only-gnomad.hg38.vcf.gz \

-pon $dir/somatic-hg38_1000g_pon.hg38.vcf.gz \

--f1r2-tar-gz NandT_f1r2.tar.gz -O NandT_raw.vcf

#Read orientation

gatk LearnReadOrientationModel -I NandT_f1r2.tar.gz -O NandT_read-orientation-model.tar.gz

gatk GetPileupSummaries -I tumor_filter_PR.bam \

-V $dir/somatic-hg38_small_exac_common_3.hg38.vcf.gz \

-L $dir/somatic-hg38_small_exac_common_3.hg38.vcf.gz \

-O tumor_pileup.table

gatk GetPileupSummaries -I normal_filter_PR.bam \

-V $dir/somatic-hg38_small_exac_common_3.hg38.vcf.gz \

-L $dir/somatic-hg38_small_exac_common_3.hg38.vcf.gz \

-O normal_pileup.table

#cross contamination or technical contamination

gatk CalculateContamination -I tumor_pileup.table -matched normal_pileup.table -O NandT_contamination.table

gatk FilterMutectCalls -R $REF -V NandT_raw.vcf --contamination-table NandT_contamination.table \

-ob-priors NandT_read-orientation-model.tar.gz -O NandT_no_contamination.vcf

# Separate vaiants into SNP and INDELs

gatk SelectVariants -R $REF -V NandT_no_contamination.vcf -select-type-to-include SNP -O NandT_no_contamination_SNP.vcf

gatk SelectVariants -R $REF -V NandT_no_contamination.vcf -select-type-to-include INDEL -O NandT_no_contamination_INDEL.vcf

# Filter variants absed on read quality; if the requested filter is absent, it will raise a warning

gatk VariantFiltration -R $REF -V NandT_no_contamination_SNP.vcf -O NandT_filtered_SNP.vcf \

--filter-name "QD_filter" -filter "QD < 2.0 " \

--filter-name "FS_filter" -filter "FS > 60.0 " \

--filter-name "MQ_filter" -filter "MQ < 40.0 " \

--filter-name "SOR_filter" -filter "SOR > 10.0 "

gatk VariantFiltration -R $REF -V NandT_no_contamination_INDEL.vcf -O NandT_filtered_INDEL.vcf \

--filter-name "QD_filter" -filter "QD < 2.0 " \

--filter-name "FS_filter" -filter "FS > 200.0 " \

--filter-name "SOR_filter" -filter "SOR > 10.0 "

#this add the SNP Ids from dbSNP

java -Xmx8g -jar $SnpSift annotate $VCF1 NandT_filtered_SNP.vcf > NandT_SNP_dbSNP.vcf

java -Xmx8g -jar $SnpSift annotate $VCF1 NandT_filtered_INDEL.vcf > NandT_INDEL_dbSNP.vcf

# SnpEff annotate the filtered variants

java -Xmx8g -jar $snpEff -s NandT_SNP_snpEFF.html -v GRCh38.86 NandT_SNP_dbSNP.vcf > NandT_SNP_ann_snpEff.vcf

java -Xmx8g -jar $snpEff -s NandT_INDEL_snpEFF.html -v GRCh38.86 NandT_INDEL_dbSNP.vcf > NandT_INDEL_ann_snpEff.vcf

Badges
Science method

More Ayaz Anwar's questions See All

When I am trying to distill triethyl amine while drying it over calcium hydride. How do I design the setup to let hydrogen escape without losing NEt3?

I am new to research. The boiling point of triethyl amine (NEt3) is 89 degrees centigrade. I am skeptical that when I will let the hydrogen escape which is getting generated in situ, I will also...

21 July 2024 7,284 4 View

How we can check the effect of light on magnetization by VAPS?

28 June 2024 9,366 0 View

As the New JCR impact factor is just released. I need a list of journals which are not tracked for impact factor this year or which are delisted?

As the New JCR impact factor is just released. I need a list of journals which are not tracked for impact factor this year or which are delisted by JCR. Thanks!

24 June 2024 6,586 2 View

How to extract the embedded GPS data from a video created from a GoPro camera?

What is the best 'free of charge' software for extracting GPS data from a video created from a GoPro camera? I have checked the Telemetry Extractor...

02 June 2024 3,138 2 View

How to modify a metaheuristic algorithm to obtain a better-performing algorithm?

28 April 2024 3,627 4 View

Should there be a BS in 'Doctor of Psychotherapy' degree to meet the growing demand for mental health problems??

"Should there be a BS in 'Doctor of Psychotherapy' degree to meet the growing demand for mental health problems?? and if so, what might such a program entail?" While "Doctor of Physiotherapy" is...

10 April 2024 7,083 3 View

What is the effects of elovera on the efficiency of transplantes human endothelial cells?

effects of elovera on the efficiency of transplanted endothelial cells

28 December 2023 1,994 0 View

Drivers and software for LIRA-300 Laser Raman Spectrometer?

We are trying to revive an old Raman Spectrometer from Lambda Scientific. This is a LIRA-300. I was wondering if someone had the drivers and the software of this now obsolete system. In...

04 November 2023 5,129 0 View

How do I run a mediation with two independent variables and one dependent variable in PLS SEM? Please suggest some paper as well its interpretation ?

28 October 2023 3,857 1 View

Immunohistochemistry of MMP-2 and MMP-9?

Immunohistochemistry of MMP-2 and MMP-9 expression in dibetic wound healimg

24 October 2023 2,449 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How to confirm the site-directed mutagenesis result without performing NGS?

I'm cloning a fragment of 3200 nts into plasmid. The cloning was successful, however, 02 amino acids were mutated. Now I want to fix these 02 aa by site-directed mutagenesis technique using...

08 August 2024 4,645 2 View

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

After performing symmetric PCR, PCR purification was performed. Afterwards, asymmetric PCR was performed using the PCR purification product as a template, but no ssDNA band was confirmed in the...

08 August 2024 1,668 3 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View