Question

Average Snp Position For Reads With Predicted Snp

0

Entering edit mode

12.7 years ago

miaochao1987 • 0

Hi,

I have a human sequencing data, and reads have been mapped with bwa. Any suggestions on what is the best way to extract following statistics? Thanks

For each predicted SNP, I would like to extract information about 1) #the number of reads with predicted SNP on + and - strand respectively 2) #average SNP location for reads with predicted SNP

snp calling • 3.2k views

ADD COMMENT • link updated 12.7 years ago by ernfrid ▴ 400 • written 12.7 years ago by miaochao1987 • 0

0

Entering edit mode

What do you mean with predicted SNP? Maybe you mean called (single nucleotide) variant (also called de novo SNPs). People talk about SNPs when a base is polymorphic in a few percent of a population. When you have just aligned your reads to a reference genome you end up with an alignment, not "predicted SNPs". Maybe you are looking for a starting point for detecting de novo SNPs and INDELS with NGS data

ADD REPLY • link 12.7 years ago by Irsan ★ 7.8k

0

Entering edit mode

yes, "called variant" would be more accurate. thanks for noticing.

ADD REPLY • link 12.7 years ago by miaochao1987 • 0

1

Entering edit mode

Ok, so you are looking for snp detection. One of the commonly used approaches is to do variant-calling on the alignment-file (probably .bam, if you have sam convert it to bam for storage purposes) with samtools mpileup. Then convert mpileup output with bcftools to vcf. You can also useGATK for variant calling on, it will also give you vcf as output. Then use summarize_annovar.pl from the annovar package to annotate the detected variants to get the impact of the the variant on amino acid, estimation whether variant is disease-causing, filter variants for those who are in dbSNP, ESP, 1000g ... You might also want to start reading some literature about variant calling with NGS data variant calling

ADD REPLY • link 12.7 years ago by Irsan ★ 7.8k

score 0 · Answer 1 · 2012-11-02

0

Entering edit mode

12.7 years ago

Josh Herr 5.8k

I use Stacks for measuring SNP data across haplotypes using the pstacks script. I've only used Bowtie to map my sequencing data from within the Stacks pipeline, but I'm fairly certain Stacks can take BWA mapped reads.

ADD COMMENT • link 12.7 years ago by Josh Herr 5.8k

score 0 · Answer 2 · 2012-11-03

Depending on your variant caller, you may have this information available with your SNP calls. For samtools' mpileup + bcftools view commands, for instance, values close to what you want are reported in the I16 tag of the VCF output (http://samtools.sourceforge.net/mpileup.shtml) with some minor calculations. Specifically, the sum of the tail distances is reported and dividing by the number of reads would give you the average distance. The number of non-reference reads on each strand is also reported (I realize this is not exactly what you are looking for).

That said, we use the bam-readcount program (fair disclosure, I am one of the authors) to calculate these values for lists of SNPs and use the downstream results for filtering. For each position you request, it will report various summary metrics for each base as a colon delimited list. If you know your SNP base then you can grab the appropriate values. It is not well documented on the web yet, but if you clone and compile it (from https://github.com/genome/bam-readcount) you can get usage by running it with no options.

The output format is as so:

chrpositionreferencebasebase:count:avgmappingquality:avgbasequality:avgsemappingquality:numplusstrand:numminusstrand:avgposasfraction:avgnummismatchesasfraction:avgsummismatchqualitiest:numq2containingreads:avgdistancetoq2startinq2reads:avgclippedlength:avgdistancetoeffective3pend...

An example of some output (at a homozygous variant position):

$ bam-readcount -q 30 -b 7 -f reference.fa some.bam 19:301238-301238 19301238C25=:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00T:25:53.56:35.28:36.44:14:11:0.40:0.01:35.88:17:0.45:97.00:0.48N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00