Hi everyone,
I'm looking for some help with a pipeline I have run for a non-model species, and would really appreciate any advice you have to offer. This is a new area for me, so I am still learning all of the tricks of the trade!
The species in question (an antelope), has a de novo assembly genome, and a 5K snp array. I received paired end fastq files for 131 individuals and followed the pipeline below to create a merged vcf file to run in plink (and other downstream analyses). When I load the vcf file into plink however, I am getting more than 5000 variants. On closer examination, the vcf seems to contain a very large portion of "missing calls", evident as 00 00 00 00 00 in the .ped version of the vcf file in plink. When I filter the vcf in plink using --geno 0.05 and --maf 0.05 the variant call is still excessively high (13 143, or see below)... so I'm wondering if there is a very important filtering step that I have not performed, that could be resulting in the extra ~ 8000 SNPs?
The SNP array was designed using an older version of the reference genome, I'm not sure if that would impact the results, but I thought it might be worth mentioning. I aligned the fastq data to a newer version of the reference genome.
When I run the resulting .ped file in Admixture (for example), the populations resemble each other is a pattern we would expect, so I feel I am close.
These are the steps I have performed thus far:
bwa mem ref.fasta r1.fastq r2.fastq > sample.sam
samtools –b view sample.sam > sample.bam
samtools sort –n sample.bam > sample_1.bam
samtools fixmate –m –O bam sample_1.bam sample_12.bam
samtools sort sample_12.bam > sample_123.bam
runpicard AddOrReplaceReadGroups I=sample_123.bam O=sample_1234.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=”sample name here”
runpicard BuildBamIndex I=sample_1234 (generates a sample_1234.bai)
rungatk –T RealignerTargetCreator –R ref.fasta –I sample_1234.bam –o sample_1234.intervals
rungatk –T IndelRealigner –R ref.fasta –I sample_1234.bam –targetIntervals sample_1234.intervals –o sample_12345.bam
runpicard MarkDuplicates –I sample_12345.bam –O sample_123456.bam –M sample_123456.txt
runpicard BuildBamIndex I=sample_123456.bam (generate a sample_123456.bai)
bcftools mpileup –Ou –f ref.fasta –b bam_list.text | bcftools call –vmO z –o cohort.vcf.gz
**loading into PLINK**
./plink –-vcf cohort.vcf.gz –-recode –-out cohort –allow-extra-chr –-geno 0.05 –-maf 0.05
6176160 variants removed due to missing genotype data (--geno)
80277 variants removed due to minor allele thresholds (--maf)
13143 variants and 131 people pass filters and QC
No phenotypes present
Ultimately, I am wondering if I still have duplicates present? Or if perhaps bcftools has called variants that are actually errors?And if there is a step I have left out in the pipeline or filtering the vcf file, what that step is and where I should place it in the workflow. Or is it normal to have more variants than the snp array?
Thank you for reading! Best wishes, RM
Is it genotyping by sequencing? Where do the fastq files come from?
It is genotyping by sequencing, performed on an Illumina Hiseq - sorry for leaving that out!
So the 5000 comes from the number of loci you amplified? I'm not an expert in GATK/Picard but to narrow down to just the 5000 you might want to intersect the results file with the 5000 SNPs you designed. The rest might be random variations picked up due to noise or unpredicted SNPs near the designed SNPs, mismatches and what not.
Thank you!! Adding a coordinate list will do the trick. I spoke to a collaborator today, and they recommend the same thing/have had to do this is previous analyses with this particular array.
Thanks for your help.
Note that you can replace
By
Thank you - that will save me some time in the future!
You can have multiple cases, but mainly in individual calls you are also considering "personal" variants (only present in one individual), if you want something more complete, you can take a look into the Population calls in GATK https://software.broadinstitute.org/gatk/documentation/article.php?id=3893