Question: SNP varian calling - decreased number of SNPS with increased coverage?
0
gravatar for anastazie.d
5.7 years ago by
anastazie.d30
Czech Republic/Prague/HPST
anastazie.d30 wrote:

Hi all,

I am working with Illumina Hiseq 2000 DNA-Seq data from 2 human saliva samples.

Two runs were performed per each sample: cleaned (human DNA cleaned from bacterial DNA) and non-cleaned.

My aim is to extract SNPs with rsID.

To process fastq.gz files I used bcbio-nextgen package (https://bcbio-nextgen.readthedocs.org).

I made two variant calling analyses. First only with cleaned runs (I analyzed samples together), second with concatenated fastq files from both runs for each sample.

I used vcftools to extract non-indel polymorphisms and bash scripting to extract all SNPs with rsID.

Here I have a problem: there is 5 times less polymorphisms from analysis for both runs than for cleaned DNA run.

See the table:

  Cleaned   All
Vcftools - #SNPs before filtering 6167677 1229010
Vcftools - #SNPs after removing indels 5213278 1024912
#SNPs with rs number 4917067 978823
     

Do you have any idea what could cause it? Analysis parameters in bcbio were the same for both analyses, the only difference is in the input fastq.gz files.

 

Thanks,

Anastassiya

 

P.S. here are command that I used to extract SNPs with rsID.

#remove INDELS
vcftools --vcf ket-gatk-haplotype.vcf --out ket-gatk-haplotype_noindel --remove-indels --recode  --recode-INFO-all
#extract head
grep -E ^# ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.head
#extract rows with rs ID
grep -E [[:space:]]rs ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.rs
#concatenate two files
cat ket_ngs.head ket_ngs.rs > ket_ngs_rs.vcf
#create tped and tfam PLINK files
vcftools --vcf ket_ngs_rs.vcf --out ket_ngs_plink --plink-tped

 

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by anastazie.d30

Could you tell us some more information about the library you sequenced, such as mean coverage, genome/exome?

ADD REPLYlink written 5.7 years ago by Matt Shirley9.3k

It was genome sequencing.

Here are values for both analyses for two samples:

Cleaned
Sample 1
Mapped reads:      890944806
Mean coverage:     29.6981602
Sample 2
Mapped reads:      659515763
Mean coverage:     21.983858767
All
Sample 1
Mapped reads:      1829469743
Mean coverage:     60.982324767
Sample 2
Mapped reads:      1325549177
Mean coverage:     44.184972567

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by anastazie.d30

For each of your samples you are removing ~1/2 of the reads as "contaminants". Could you explain this procedure a bit more? I have a feeling that including this extra coverage is adding more noise to your samples somehow. The bcbio-nextgen has many configuration options. Perhaps you could post your YAML configuration file for the run?

ADD REPLYlink written 5.7 years ago by Matt Shirley9.3k
0
gravatar for anastazie.d
5.7 years ago by
anastazie.d30
Czech Republic/Prague/HPST
anastazie.d30 wrote:

Hi everyone,

finally it was problem with folders removing during re-running bcbio-nextgen.

I had opened this issue here https://github.com/chapmanb/bcbio-nextgen/issues/593

 

Thanks for comments,

Anastassiya

ADD COMMENTlink written 5.7 years ago by anastazie.d30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1447 users visited in the last hour