I'm looking for some help with a pipeline I have run for a non-model species, and would really appreciate any advice you have to offer. This is a new area for me, so I am still learning all of the tricks of the trade!
The species in question (an antelope), has a de novo assembly genome, and a 5K snp array. I received paired end fastq files for 131 individuals and followed the pipeline below to create a merged vcf file to run in plink (and other downstream analyses). When I load the vcf file into plink however, I am getting more than 5000 variants. On closer examination, the vcf seems to contain a very large portion of "missing calls", evident as 00 00 00 00 00 in the .ped version of the vcf file in plink. When I filter the vcf in plink using --geno 0.05 and --maf 0.05 the variant call is still excessively high (13 143, or see below)... so I'm wondering if there is a very important filtering step that I have not performed, that could be resulting in the extra ~ 8000 SNPs?
The SNP array was designed using an older version of the reference genome, I'm not sure if that would impact the results, but I thought it might be worth mentioning. I aligned the fastq data to a newer version of the reference genome.
When I run the resulting .ped file in Admixture (for example), the populations resemble each other is a pattern we would expect, so I feel I am close.
These are the steps I have performed thus far:
bwa mem ref.fasta r1.fastq r2.fastq > sample.sam samtools –b view sample.sam > sample.bam samtools sort –n sample.bam > sample_1.bam samtools fixmate –m –O bam sample_1.bam sample_12.bam samtools sort sample_12.bam > sample_123.bam runpicard AddOrReplaceReadGroups I=sample_123.bam O=sample_1234.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=”sample name here” runpicard BuildBamIndex I=sample_1234 (generates a sample_1234.bai) rungatk –T RealignerTargetCreator –R ref.fasta –I sample_1234.bam –o sample_1234.intervals rungatk –T IndelRealigner –R ref.fasta –I sample_1234.bam –targetIntervals sample_1234.intervals –o sample_12345.bam runpicard MarkDuplicates –I sample_12345.bam –O sample_123456.bam –M sample_123456.txt runpicard BuildBamIndex I=sample_123456.bam (generate a sample_123456.bai) bcftools mpileup –Ou –f ref.fasta –b bam_list.text | bcftools call –vmO z –o cohort.vcf.gz **loading into PLINK** ./plink –-vcf cohort.vcf.gz –-recode –-out cohort –allow-extra-chr –-geno 0.05 –-maf 0.05 6176160 variants removed due to missing genotype data (--geno) 80277 variants removed due to minor allele thresholds (--maf) 13143 variants and 131 people pass filters and QC No phenotypes present
Ultimately, I am wondering if I still have duplicates present? Or if perhaps bcftools has called variants that are actually errors?And if there is a step I have left out in the pipeline or filtering the vcf file, what that step is and where I should place it in the workflow. Or is it normal to have more variants than the snp array?
Thank you for reading! Best wishes, RM