How to receive mismatch statistic?
1
0
Entering edit mode
8.1 years ago

Hello everyone. I have a little question here. So I have NGS reads, appropriate reference genome and BED file with list of regions of interest. My final aim is to receive a list with information about mismatches. For example if we have data like this:

refgenseq: TCAATCAGGTGATGGAATTCATAGATTTCCATCCCGATAAATCCGGCACC

NGSread1: ___________GATGaAATTCtTAGATTTC

NGSread2: __________________TTCATAaATTTCCtTCCC

NGSread2: ____________________________TCCATCCCaATAAAT

I would like to receive data like this:

  • number of mismatches: 5
  • A-T subst: 1
  • A-G subst: 0
  • A-C subst: 0
  • A-G subst: 0
  • G-A subst: 2 etc.

I know that CIGAR parameter contain information about substitutions, but when I make sorting my SAM file with mapped reads with BED tool that contains coordinates of reads of interest by "Filter SAM or BAM, output SAM or BAM" option in the SAM-tool, all parameters, including CIGAR are lost. So how can I extract reads which comply with according regions of interest?

And I think SAMStat fits to me but may be is it possible to solve this problem with standard GALAXY sources? I am MAC OS not linux user, so it's a little bit difficult to use all this program out of GALAXY. But again - I have troubles with sorting reads.

Thank you for answering.

snp next-gen • 1.6k views
ADD COMMENT
1
Entering edit mode
8.1 years ago

using sam2tsv : https://github.com/lindenb/jvarkit/wiki/SAM2Tsv you can print the read bases and the ref bases in the genomic window.

samtools view your.bam "22:987-10981" | \
java -jar dist/sam2tsv.jar -r ref.fa | \
cut -f 5,8 | sort | uniq -c

(not tested)

ADD COMMENT

Login before adding your answer.

Traffic: 2921 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6