Question

Regarding GATK for snp analysis

0

Entering edit mode

6.8 years ago

DL ▴ 50

Hi everyone, This is the first time that i am going to use gatk. So i have some questions related to realignment targets. I have paired end illumina sequenicng data of 8 species from the 3 lanes.

lane 1 in flowcell contains 8 species that generate R1 and R2 for each species those had been loaded on the lane 2, lane 3 as well. Then i merged R1 and R2 of each species from lane 1, 2 and 3 separately. For snp analysis, i am going to use this steps:

#Index reference
bwa index reference.fasta
#Sort reference
samtools faidx reference.fasta
#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict
#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate
#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt
#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam
#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list
#Indel realignment
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam
#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

The resulting vcf file will contain your variant calls!

Then you can optionally filter the variants:

#Filter variants
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf

Or first split the raw.vcf file into SNPs and indels:

#Extract SNPs
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf
#Extract Indels
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf

When i come to RealignerTargetCreator then it show that SAM file doesn't have any read groups defined in the header.Then i found that i should run AddOrReplaceReadGroups but i did not get at which steps i should run this ?? and what does exactly it do. i read about that but did get too much.

So please anyone can tell me about that.

Thanks in Advance

genome SNP next-gen • 2.6k views

ADD COMMENT • link updated 6.8 years ago by andrew.j.skelton73 6.5k • written 6.8 years ago by DL ▴ 50

2

Entering edit mode

Is there a reason you set ploidy to 1? Is there a reason you changed the -stand_call_conf and -stand_emit_conf options?

BTW, you don't need to realign around indels (see the GATK best practices).

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks For your reply; Can you please tell me that at which condition we should merge vcf file ?? because in many discussion i have read that ppl merged their bam files and also vcf file but i did not get when it will good. I have 8 different species after mapping, i have different 8 bam files. Now i want to check variants in all species so i will get 8 vcf files after running GATK. So, where will be the logic of merging bam and vcf files apply ??

ADD REPLY • link 6.8 years ago by DL ▴ 50

0

Entering edit mode

Merging BAM files was used for the UnifiedGenotyper, which is no longer the best practice. You can merge VCF files whenever you need all of the samples together or when you start with gVCF and want to utilize all samples for the final calls.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Ohk, can you suggest me some post or blog or papers where they explain about these things ?? Thanks

ADD REPLY • link 6.8 years ago by DL ▴ 50

0

Entering edit mode

The GATK website and forum should be your go-to resources.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

score 0 · Answer 1 · 2017-07-04

It seems that the GATK best practises for germline exome/ genome panels is still being added to, so I can't find the specific document you'd need. The best practises are available here. The ReadGroup is defined using -R in bwa mem (-R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01”), so one of two things has happened in your run....

1: You've used exactly the same read group string for each sample which will likely cause downstream errors.

2: Somehow, your readgroup isn't been preserved throughout your steps.

I'd suggest you do some research on readgroups and why they're important, here's a good starting point.