Question

Why do I have more than 5000 variants when I used a 5K snp array?

0

Entering edit mode

4.8 years ago

rgooley • 0

Hi everyone,

I'm looking for some help with a pipeline I have run for a non-model species, and would really appreciate any advice you have to offer. This is a new area for me, so I am still learning all of the tricks of the trade!

The species in question (an antelope), has a de novo assembly genome, and a 5K snp array. I received paired end fastq files for 131 individuals and followed the pipeline below to create a merged vcf file to run in plink (and other downstream analyses). When I load the vcf file into plink however, I am getting more than 5000 variants. On closer examination, the vcf seems to contain a very large portion of "missing calls", evident as 00 00 00 00 00 in the .ped version of the vcf file in plink. When I filter the vcf in plink using --geno 0.05 and --maf 0.05 the variant call is still excessively high (13 143, or see below)... so I'm wondering if there is a very important filtering step that I have not performed, that could be resulting in the extra ~ 8000 SNPs?

The SNP array was designed using an older version of the reference genome, I'm not sure if that would impact the results, but I thought it might be worth mentioning. I aligned the fastq data to a newer version of the reference genome.

When I run the resulting .ped file in Admixture (for example), the populations resemble each other is a pattern we would expect, so I feel I am close.

These are the steps I have performed thus far:

bwa mem ref.fasta r1.fastq r2.fastq > sample.sam

samtools –b view sample.sam > sample.bam

samtools sort –n sample.bam > sample_1.bam

samtools fixmate –m –O bam sample_1.bam sample_12.bam

samtools sort sample_12.bam > sample_123.bam

runpicard AddOrReplaceReadGroups I=sample_123.bam O=sample_1234.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=”sample name here”

runpicard BuildBamIndex I=sample_1234  (generates a sample_1234.bai)

rungatk –T RealignerTargetCreator –R ref.fasta –I sample_1234.bam –o sample_1234.intervals

rungatk –T IndelRealigner –R ref.fasta –I sample_1234.bam –targetIntervals sample_1234.intervals –o sample_12345.bam

runpicard MarkDuplicates –I sample_12345.bam –O sample_123456.bam –M sample_123456.txt

runpicard BuildBamIndex I=sample_123456.bam  (generate a sample_123456.bai)

bcftools mpileup –Ou –f ref.fasta –b bam_list.text | bcftools call –vmO z –o cohort.vcf.gz

**loading into PLINK**

./plink –-vcf cohort.vcf.gz  –-recode –-out cohort –allow-extra-chr –-geno 0.05 –-maf 0.05

6176160 variants removed due to missing genotype data (--geno) 
80277 variants removed due to minor allele thresholds (--maf)
13143 variants and 131 people pass filters and QC
No phenotypes present

Ultimately, I am wondering if I still have duplicates present? Or if perhaps bcftools has called variants that are actually errors?And if there is a step I have left out in the pipeline or filtering the vcf file, what that step is and where I should place it in the workflow. Or is it normal to have more variants than the snp array?

Thank you for reading! Best wishes, RM

snp plink vcf variant alignment • 1.1k views

ADD COMMENT • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by rgooley • 0

1

Entering edit mode

Is it genotyping by sequencing? Where do the fastq files come from?

ADD REPLY • link 4.8 years ago by Asaf 10k

0

Entering edit mode

It is genotyping by sequencing, performed on an Illumina Hiseq - sorry for leaving that out!

ADD REPLY • link 4.8 years ago by rgooley • 0

2

Entering edit mode

So the 5000 comes from the number of loci you amplified? I'm not an expert in GATK/Picard but to narrow down to just the 5000 you might want to intersect the results file with the 5000 SNPs you designed. The rest might be random variations picked up due to noise or unpredicted SNPs near the designed SNPs, mismatches and what not.

ADD REPLY • link 4.8 years ago by Asaf 10k

0

Entering edit mode

Thank you!! Adding a coordinate list will do the trick. I spoke to a collaborator today, and they recommend the same thing/have had to do this is previous analyses with this particular array.

Thanks for your help.

ADD REPLY • link 4.8 years ago by rgooley • 0

1

Entering edit mode

Note that you can replace

bwa mem ref.fasta r1.fastq r2.fastq > sample.sam
samtools –b view sample.sam > sample.bam
samtools sort –n sample.bam > sample_1.bam

By

bwa mem ref.fasta r1.fastq r2.fastq | samtools sort -n -o sample_1.bam

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you - that will save me some time in the future!

ADD REPLY • link 4.8 years ago by rgooley • 0

0

Entering edit mode

You can have multiple cases, but mainly in individual calls you are also considering "personal" variants (only present in one individual), if you want something more complete, you can take a look into the Population calls in GATK https://software.broadinstitute.org/gatk/documentation/article.php?id=3893

ADD REPLY • link 4.8 years ago by JC 13k