Question: How To Retrieve Reads Supporting A Snp/Indel
1
gravatar for Frewise
5.7 years ago by
Frewise20
Frewise20 wrote:

I used samtools and varscan for variant calling like this:

samtools mpileup \
  -f genome.fa \
  tku.sorted.rmdup.bam \
  | java -jar VarScan.v2.3.5.jar pileup2snp \
  --min-coverage 8 \
  --min-reads2 2 \
  --min-avg-qual 15 \
  --min-var-freq 0.01 \
  --p-value 0.001 \
  >tku.snp.vcf

now I want to retrieve all the reads supporting a specific SNP/indel and calculate the distribution of SNP/indel on the sequenced reads, is there any tools or proper way to achieve this?

reads indel snp varscan • 4.8k views
ADD COMMENTlink modified 5.6 years ago by rohan100 • written 5.7 years ago by Frewise20

I always first check what is actually happening with IGV (http://www.broadinstitute.org/igv/). Here you can nicely see the mismatches/SNPs with the reference genome and get a general ID if the SNP is real.

Later you can extract the reads with this command: samtools view BAMFILE.BAM chr1:1000000-2000000 > NEWFILE.bam

ADD REPLYlink written 5.7 years ago by Floris Brenk890

Thanks for your advice! But it is still difficult to acquire the coordinate of a SNP on reads, what's more, there are thousands of SNPs to process. So, an automatic workflow is prefered.

ADD REPLYlink written 5.7 years ago by Frewise20

i use python and pysam to read and manipulate reads but it requires some programming skills http://www.cgat.org/~andreas/documentation/pysam/api.html

ADD REPLYlink written 5.7 years ago by arno.guille400
3
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

my tool SAM2Tsv https://github.com/lindenb/jvarkit/wiki/SAM2Tsv would print each position with the name of the read and the cigar operator, you can easily join the ouput to another file (a VCF...) to select the reads.

ADD COMMENTlink written 5.7 years ago by Pierre Lindenbaum123k
1

Thanks for your tool, Pierre. It works fine!

ADD REPLYlink written 5.7 years ago by Frewise20
1
gravatar for Andreas
5.7 years ago by
Andreas2.4k
Singapore
Andreas2.4k wrote:

This script might also help: it adds a special tag to reads supporting the reference or variant base: (requires a recent version of pysam). It doesn't support indels at the moment though.

Andreas

ADD COMMENTlink written 5.7 years ago by Andreas2.4k
1
gravatar for mikhail.shugay
5.7 years ago by
mikhail.shugay3.3k
Czech Republic, Brno, CEITEC
mikhail.shugay3.3k wrote:

Hello! I have a script that solves a similar problem ( https://github.com/mikessh/mont/blob/master/src/molcount/MolCountSNP.groovy ) for SNPs. It calculates the number of reads that overlap with a given SNP and have a variant/reference or other nt in the position. It also estimates the number of "molecules" i.e. distinct reads according to offset. The results are in good agreement with e.g. TUMOR_F allele frequency estimated by MuTect. Anyways there are a parts of code that implement Picard Java API that could be of use. Hope you'll find it useful!

ADD COMMENTlink written 5.7 years ago by mikhail.shugay3.3k
0
gravatar for rohan
5.6 years ago by
rohan100
United States
rohan100 wrote:

I tried all of above for similar purpose

Biostar59647.. made a output xml of ~3GB.. that is very difficult to parse.. (run out of memory)

SAM2Tsv is better in that one need not parse xmls. but still the the problem is - identifying the mismatched position.. (the output only contains cigar codes Ms @ both matches and mismatches)

My question is that is there any alternate tool or a tool to get only the mismatches out of SAM2Tsv's .txt output?

@mikhail.shugay Can I use a custom vcf file containing only the snps of my interest in MolCountSNP.groovy ? eg. A->T @25th postion ..

ADD COMMENTlink written 5.6 years ago by rohan100
1

Biostar59647 should be parsed with a sax/stax parser, not a dom parser.

SAM2Tsv provides both REF and ALT columns.

ADD REPLYlink written 5.6 years ago by Pierre Lindenbaum123k

i was actually using dom parser for Biostar59647. thanks for the suggestion.

i have a question.. may be i can ask here.. so..SAM2Tsv is very helpful in getting REF and ALT bases. but one more thing that would also be equally helpful is quality scores of the bases ...

is it possible to edit sam2tsv to get base qualities? there is a picard tool which gets base qualities.. may be we can get that output also through sam2tsv?

ADD REPLYlink written 5.5 years ago by rohan100
1

Yes, I can do this during the WE.

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum123k
1

Done: https://github.com/lindenb/jvarkit/wiki/SAM2Tsv

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum123k

you r just gr8 ! may you hv a gr8 "WE" too.. thanks a lot..

ADD REPLYlink written 5.5 years ago by rohan100

@Pierre Lindenbaum.. i have a question @ get amino acid changes from snp calls .. may be you can answer it

ADD REPLYlink written 5.5 years ago by rohan100

i dont know what "during the WE" means? im not geting it

ADD REPLYlink written 5.5 years ago by rohan100
1

during the week-end :-)

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum123k

reg current version of sam2tsv (with quality scores)..

is it there also difference in snp calling?

because if i compare the outputs of both, the total number of lines is different.

eg. test1.sam.sam2tsv1 test1.sam.sam2tsv2 are the outputs of current and previous versions resp. 

kclabws1@kclabws1:~/Documents/ngs/test/sam$ wc -l test1.sam.sam2tsv1
100401 test1.sam.sam2tsv1
kclabws1@kclabws1:~/Documents/ngs/test/sam$ wc -l test1.sam.sam2tsv2
99738 test1.sam.sam2tsv2

 

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by rohan100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1108 users visited in the last hour