How To Retrieve Reads Supporting A Snp/Indel
4
1
Entering edit mode
8.4 years ago
Frewise ▴ 20

I used samtools and varscan for variant calling like this:

samtools mpileup \
-f genome.fa \
tku.sorted.rmdup.bam \
| java -jar VarScan.v2.3.5.jar pileup2snp \
--min-coverage 8 \
--min-avg-qual 15 \
--min-var-freq 0.01 \
--p-value 0.001 \
>tku.snp.vcf

now I want to retrieve all the reads supporting a specific SNP/indel and calculate the distribution of SNP/indel on the sequenced reads, is there any tools or proper way to achieve this?

reads snp indel varscan • 6.5k views
0
Entering edit mode

I always first check what is actually happening with IGV (http://www.broadinstitute.org/igv/). Here you can nicely see the mismatches/SNPs with the reference genome and get a general ID if the SNP is real.

Later you can extract the reads with this command: samtools view BAMFILE.BAM chr1:1000000-2000000 > NEWFILE.bam

0
Entering edit mode

Thanks for your advice! But it is still difficult to acquire the coordinate of a SNP on reads, what's more, there are thousands of SNPs to process. So, an automatic workflow is prefered.

0
Entering edit mode

i use python and pysam to read and manipulate reads but it requires some programming skills http://www.cgat.org/~andreas/documentation/pysam/api.html

3
Entering edit mode
8.4 years ago

my tool SAM2Tsv https://github.com/lindenb/jvarkit/wiki/SAM2Tsv would print each position with the name of the read and the cigar operator, you can easily join the ouput to another file (a VCF...) to select the reads.

1
Entering edit mode

Thanks for your tool, Pierre. It works fine!

1
Entering edit mode
8.4 years ago
Andreas ★ 2.5k

This script might also help: it adds a special tag to reads supporting the reference or variant base: (requires a recent version of pysam). It doesn't support indels at the moment though.

Andreas

1
Entering edit mode
8.4 years ago

Hello! I have a script that solves a similar problem ( https://github.com/mikessh/mont/blob/master/src/molcount/MolCountSNP.groovy ) for SNPs. It calculates the number of reads that overlap with a given SNP and have a variant/reference or other nt in the position. It also estimates the number of "molecules" i.e. distinct reads according to offset. The results are in good agreement with e.g. TUMOR_F allele frequency estimated by MuTect. Anyways there are a parts of code that implement Picard Java API that could be of use. Hope you'll find it useful!

0
Entering edit mode
8.2 years ago
rohan ▴ 100

I tried all of above for similar purpose

Biostar59647.. made a output xml of ~3GB.. that is very difficult to parse.. (run out of memory)

SAM2Tsv is better in that one need not parse xmls. but still the the problem is - identifying the mismatched position.. (the output only contains cigar codes Ms @ both matches and mismatches)

My question is that is there any alternate tool or a tool to get only the mismatches out of SAM2Tsv's .txt output?

@mikhail.shugay Can I use a custom vcf file containing only the snps of my interest in MolCountSNP.groovy ? eg. A->T @25th postion ..

1
Entering edit mode

Biostar59647 should be parsed with a sax/stax parser, not a dom parser.

SAM2Tsv provides both REF and ALT columns.

0
Entering edit mode

i was actually using dom parser for Biostar59647. thanks for the suggestion.

i have a question.. may be i can ask here.. so..SAM2Tsv is very helpful in getting REF and ALT bases. but one more thing that would also be equally helpful is quality scores of the bases ...

is it possible to edit sam2tsv to get base qualities? there is a picard tool which gets base qualities.. may be we can get that output also through sam2tsv?

1
Entering edit mode

Yes, I can do this during the WE.

1
Entering edit mode
0
Entering edit mode

you r just gr8 ! may you hv a gr8 "WE" too.. thanks a lot..

0
Entering edit mode

@Pierre Lindenbaum.. i have a question @ get amino acid changes from snp calls .. may be you can answer it

0
Entering edit mode

i dont know what "during the WE" means? im not geting it

1
Entering edit mode

during the week-end :-)

0
Entering edit mode

reg current version of sam2tsv (with quality scores)..

is it there also difference in snp calling?

because if I compare the outputs of both, the total number of lines is different.

eg. test1.sam.sam2tsv1 test1.sam.sam2tsv2 are the outputs of current and previous versions resp.

kclabws1@kclabws1:~/Documents/ngs/test/sam$wc -l test1.sam.sam2tsv1 100401 test1.sam.sam2tsv1 kclabws1@kclabws1:~/Documents/ngs/test/sam$ wc -l test1.sam.sam2tsv2
99738 test1.sam.sam2tsv2