Question: Regarding GATK for snp analysis
gravatar for DL
2.9 years ago by
DL30 wrote:

Hi everyone, This is the first time that i am going to use gatk. So i have some questions related to realignment targets. I have paired end illumina sequenicng data of 8 species from the 3 lanes.

lane 1 in flowcell contains 8 species that generate R1 and R2 for each species those had been loaded on the lane 2, lane 3 as well. Then i merged R1 and R2 of each species from lane 1, 2 and 3 separately. For snp analysis, i am going to use this steps:

#Index reference
bwa index reference.fasta
#Sort reference
samtools faidx reference.fasta
#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict
#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate
#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt
#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam
#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list
#Indel realignment
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam
#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

The resulting vcf file will contain your variant calls!

Then you can optionally filter the variants:

#Filter variants
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf

Or first split the raw.vcf file into SNPs and indels:

#Extract SNPs
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf
#Extract Indels
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf

When i come to RealignerTargetCreator then it show that SAM file doesn't have any read groups defined in the header.Then i found that i should run AddOrReplaceReadGroups but i did not get at which steps i should run this ?? and what does exactly it do. i read about that but did get too much.

So please anyone can tell me about that.

Thanks in Advance

snp next-gen genome • 1.5k views
ADD COMMENTlink modified 2.9 years ago by andrew.j.skelton735.9k • written 2.9 years ago by DL30

Is there a reason you set ploidy to 1? Is there a reason you changed the -stand_call_conf and -stand_emit_conf options?

BTW, you don't need to realign around indels (see the GATK best practices).

ADD REPLYlink written 2.9 years ago by Devon Ryan95k

Thanks For your reply; Can you please tell me that at which condition we should merge vcf file ?? because in many discussion i have read that ppl merged their bam files and also vcf file but i did not get when it will good. I have 8 different species after mapping, i have different 8 bam files. Now i want to check variants in all species so i will get 8 vcf files after running GATK. So, where will be the logic of merging bam and vcf files apply ??

ADD REPLYlink written 2.9 years ago by DL30

Merging BAM files was used for the UnifiedGenotyper, which is no longer the best practice. You can merge VCF files whenever you need all of the samples together or when you start with gVCF and want to utilize all samples for the final calls.

ADD REPLYlink written 2.9 years ago by Devon Ryan95k

Ohk, can you suggest me some post or blog or papers where they explain about these things ?? Thanks

ADD REPLYlink written 2.9 years ago by DL30

The GATK website and forum should be your go-to resources.

ADD REPLYlink written 2.9 years ago by Devon Ryan95k
gravatar for andrew.j.skelton73
2.9 years ago by
andrew.j.skelton735.9k wrote:

It seems that the GATK best practises for germline exome/ genome panels is still being added to, so I can't find the specific document you'd need. The best practises are available here. The ReadGroup is defined using -R in bwa mem (-R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01”), so one of two things has happened in your run....

1: You've used exactly the same read group string for each sample which will likely cause downstream errors.

2: Somehow, your readgroup isn't been preserved throughout your steps.

I'd suggest you do some research on readgroups and why they're important, here's a good starting point.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by andrew.j.skelton735.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1469 users visited in the last hour