I am working with Illumina Hiseq 2000 DNA-Seq data from 2 human saliva samples.
Two runs were performed per each sample: cleaned (human DNA cleaned from bacterial DNA) and non-cleaned.
My aim is to extract SNPs with rsID.
To process fastq.gz files I used bcbio-nextgen package (https://bcbio-nextgen.readthedocs.org).
I made two variant calling analyses. First only with cleaned runs (I analyzed samples together), second with concatenated fastq files from both runs for each sample.
I used vcftools to extract non-indel polymorphisms and bash scripting to extract all SNPs with rsID.
Here I have a problem: there is 5 times less polymorphisms from analysis for both runs than for cleaned DNA run.
See the table:
|Vcftools - #SNPs before filtering||6167677||1229010|
|Vcftools - #SNPs after removing indels||5213278||1024912|
|#SNPs with rs number||4917067||978823|
Do you have any idea what could cause it? Analysis parameters in bcbio were the same for both analyses, the only difference is in the input fastq.gz files.
P.S. here are command that I used to extract SNPs with rsID.
#remove INDELS vcftools --vcf ket-gatk-haplotype.vcf --out ket-gatk-haplotype_noindel --remove-indels --recode --recode-INFO-all #extract head grep -E ^# ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.head #extract rows with rs ID grep -E [[:space:]]rs ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.rs #concatenate two files cat ket_ngs.head ket_ngs.rs > ket_ngs_rs.vcf #create tped and tfam PLINK files vcftools --vcf ket_ngs_rs.vcf --out ket_ngs_plink --plink-tped