Question: Why do I have more than 5000 variants when I used a 5K snp array?
gravatar for rgooley
16 months ago by
rgooley0 wrote:

Hi everyone,

I'm looking for some help with a pipeline I have run for a non-model species, and would really appreciate any advice you have to offer. This is a new area for me, so I am still learning all of the tricks of the trade!

The species in question (an antelope), has a de novo assembly genome, and a 5K snp array. I received paired end fastq files for 131 individuals and followed the pipeline below to create a merged vcf file to run in plink (and other downstream analyses). When I load the vcf file into plink however, I am getting more than 5000 variants. On closer examination, the vcf seems to contain a very large portion of "missing calls", evident as 00 00 00 00 00 in the .ped version of the vcf file in plink. When I filter the vcf in plink using --geno 0.05 and --maf 0.05 the variant call is still excessively high (13 143, or see below)... so I'm wondering if there is a very important filtering step that I have not performed, that could be resulting in the extra ~ 8000 SNPs?

The SNP array was designed using an older version of the reference genome, I'm not sure if that would impact the results, but I thought it might be worth mentioning. I aligned the fastq data to a newer version of the reference genome.

When I run the resulting .ped file in Admixture (for example), the populations resemble each other is a pattern we would expect, so I feel I am close.

These are the steps I have performed thus far:

bwa mem ref.fasta r1.fastq r2.fastq > sample.sam

samtools –b view sample.sam > sample.bam

samtools sort –n sample.bam > sample_1.bam

samtools fixmate –m –O bam sample_1.bam sample_12.bam

samtools sort sample_12.bam > sample_123.bam

runpicard AddOrReplaceReadGroups I=sample_123.bam O=sample_1234.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=”sample name here”

runpicard BuildBamIndex I=sample_1234  (generates a sample_1234.bai)

rungatk –T RealignerTargetCreator –R ref.fasta –I sample_1234.bam –o sample_1234.intervals

rungatk –T IndelRealigner –R ref.fasta –I sample_1234.bam –targetIntervals sample_1234.intervals –o sample_12345.bam

runpicard MarkDuplicates –I sample_12345.bam –O sample_123456.bam –M sample_123456.txt

runpicard BuildBamIndex I=sample_123456.bam  (generate a sample_123456.bai)

bcftools mpileup –Ou –f ref.fasta –b bam_list.text | bcftools call –vmO z –o cohort.vcf.gz

**loading into PLINK**

./plink –-vcf cohort.vcf.gz  –-recode –-out cohort –allow-extra-chr –-geno 0.05 –-maf 0.05

6176160 variants removed due to missing genotype data (--geno) 
80277 variants removed due to minor allele thresholds (--maf)
13143 variants and 131 people pass filters and QC
No phenotypes present

Ultimately, I am wondering if I still have duplicates present? Or if perhaps bcftools has called variants that are actually errors?And if there is a step I have left out in the pipeline or filtering the vcf file, what that step is and where I should place it in the workflow. Or is it normal to have more variants than the snp array?

Thank you for reading! Best wishes, RM

snp plink variant alignment vcf • 411 views
ADD COMMENTlink modified 16 months ago by genomax91k • written 16 months ago by rgooley0

Is it genotyping by sequencing? Where do the fastq files come from?

ADD REPLYlink written 16 months ago by Asaf8.4k

It is genotyping by sequencing, performed on an Illumina Hiseq - sorry for leaving that out!

ADD REPLYlink written 16 months ago by rgooley0

So the 5000 comes from the number of loci you amplified? I'm not an expert in GATK/Picard but to narrow down to just the 5000 you might want to intersect the results file with the 5000 SNPs you designed. The rest might be random variations picked up due to noise or unpredicted SNPs near the designed SNPs, mismatches and what not.

ADD REPLYlink written 16 months ago by Asaf8.4k

Thank you!! Adding a coordinate list will do the trick. I spoke to a collaborator today, and they recommend the same thing/have had to do this is previous analyses with this particular array.

Thanks for your help.

ADD REPLYlink written 16 months ago by rgooley0

Note that you can replace

bwa mem ref.fasta r1.fastq r2.fastq > sample.sam
samtools –b view sample.sam > sample.bam
samtools sort –n sample.bam > sample_1.bam


bwa mem ref.fasta r1.fastq r2.fastq | samtools sort -n -o sample_1.bam
ADD REPLYlink written 16 months ago by WouterDeCoster44k

Thank you - that will save me some time in the future!

ADD REPLYlink written 16 months ago by rgooley0

You can have multiple cases, but mainly in individual calls you are also considering "personal" variants (only present in one individual), if you want something more complete, you can take a look into the Population calls in GATK

ADD REPLYlink written 16 months ago by JC11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1169 users visited in the last hour