Question

SNP varian calling - decreased number of SNPS with increased coverage?

0

Entering edit mode

10.8 years ago

anastazie.d ▴ 40

Hi all,

I am working with Illumina Hiseq 2000 DNA-Seq data from 2 human saliva samples.

Two runs were performed per each sample: cleaned (human DNA cleaned from bacterial DNA) and non-cleaned.

My aim is to extract SNPs with rsID.

To process fastq.gz files I used bcbio-nextgen package.

I made two variant calling analyses. First only with cleaned runs (I analyzed samples together), second with concatenated fastq files from both runs for each sample.

I used vcftools to extract non-indel polymorphisms and bash scripting to extract all SNPs with rsID.

Here I have a problem: there is 5 times less polymorphisms from analysis for both runs than for cleaned DNA run.

See the table:

                                        Cleaned       All
Vcftools - #SNPs before filtering       6167677     1229010
Vcftools - #SNPs after removing indels  5213278     1024912
#SNPs with rs number                    4917067     978823

Do you have any idea what could cause it? Analysis parameters in bcbio were the same for both analyses, the only difference is in the input fastq.gz files.

Thanks,

Anastassiya

P.S. here are command that I used to extract SNPs with rsID.

#remove INDELS
vcftools --vcf ket-gatk-haplotype.vcf --out ket-gatk-haplotype_noindel --remove-indels --recode  --recode-INFO-all
#extract head
grep -E ^# ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.head
#extract rows with rs ID
grep -E [[:space:]]rs ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.rs
#concatenate two files
cat ket_ngs.head ket_ngs.rs > ket_ngs_rs.vcf
#create tped and tfam PLINK files
vcftools --vcf ket_ngs_rs.vcf --out ket_ngs_plink --plink-tped

coverage DNA-Seq bcbio-nextgen SNP vcftools • 3.2k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by anastazie.d ▴ 40

0

Entering edit mode

Could you tell us some more information about the library you sequenced, such as mean coverage, genome/exome?

ADD REPLY • link 10.8 years ago by Matt Shirley 10k

0

Entering edit mode

It was genome sequencing.

Here are values for both analyses for two samples:

Cleaned
Sample 1
Mapped reads:      890944806
Mean coverage:     29.6981602
Sample 2
Mapped reads:      659515763
Mean coverage:     21.983858767
All
Sample 1
Mapped reads:      1829469743
Mean coverage:     60.982324767
Sample 2
Mapped reads:      1325549177
Mean coverage:     44.184972567

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by anastazie.d ▴ 40

0

Entering edit mode

For each of your samples you are removing ~1/2 of the reads as "contaminants". Could you explain this procedure a bit more? I have a feeling that including this extra coverage is adding more noise to your samples somehow. The bcbio-nextgen has many configuration options. Perhaps you could post your YAML configuration file for the run?

ADD REPLY • link 10.8 years ago by Matt Shirley 10k

Ram · Answer 1 · 2014-10-01

0

Entering edit mode

10.8 years ago

anastazie.d ▴ 40

Hi everyone,

finally it was problem with folders removing during re-running bcbio-nextgen.

I had opened this issue here

Thanks for comments,

Anastassiya

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by anastazie.d ▴ 40