Question: How do we find the number of sequences with terminal mismatches?
0
gravatar for MAPK
19 months ago by
MAPK1.4k
United States
MAPK1.4k wrote:

I have created a bowtie aligned .sam file with this command: bowtie -S -p 18 /media/owner/ref.fa test1.fastq test1.sam

which I then converted to .bam file like this: samtools view -bS -o test1.bam test1.sam

From this test1.bam, I was able to get the total number of plus and minus strand of reads aligned to the reference loci using bedtools coverage maping (for positives reads and negative reads) like this: bedtools coverage -a my_region.bed -b test1.bam -bed -d -s | gzip > test1_region_pos.tsv.gz

Since I used default bowtie option which allows for three mismatches, I want to know how many reads are aligned with terminal mismatches (i.e,last three, last two and last one base mismatched at 3' region of the aligned reads at their loci position). How can I get this information? Thank for your help in advance.

bowtie bam • 690 views
ADD COMMENTlink modified 19 months ago by harold.smith.tarheel4.4k • written 19 months ago by MAPK1.4k
0
gravatar for harold.smith.tarheel
19 months ago by
United States
harold.smith.tarheel4.4k wrote:

You should be able to obtain that information by parsing the end of the MD tag for three, two, or one 'GATC' characters. The alternative would be to parse the end of the CIGAR string for X characters (mismatch), but I believe Bowtie predates that feature of the SAM standard.

ADD COMMENTlink written 19 months ago by harold.smith.tarheel4.4k

Could you please elaborate a bit? Do you mean MD tag within samtools? Thanks for your help.

ADD REPLYlink modified 19 months ago • written 19 months ago by MAPK1.4k
1

The MD tag is one of the predefined standards in SAM/BAM file specification, used to represent mismatches. More info can be found here.

EDIT for more information: The MD tag contains the REFERENCE nucleotide at a position if there is a mismatch, so any read that contains a terminal mismatch with contain G/A/T/C in the last position of this tag.

ADD REPLYlink modified 19 months ago • written 19 months ago by harold.smith.tarheel4.4k

I need positional mismatch. For example, I need total number of reads of length 18 bases where with total number of mismatch at 18th position. Same for 19bases long sequences position and so on.

ADD REPLYlink modified 19 months ago • written 19 months ago by MAPK1.4k

For that you could check the CIGAR flag ans check for reads with the last base as mismatch. For 18 bp reads it should be 17M1S

ADD REPLYlink written 19 months ago by Nicolas Rosewick8.1k

I believe 'S' represents soft clipping in the CIGAR. 'X' represents mismatch, but was not used in earlier versions of SAM.

ADD REPLYlink written 19 months ago by harold.smith.tarheel4.4k

As it's the last bp I guess it will be associated to soft clipping

ADD REPLYlink written 19 months ago by Nicolas Rosewick8.1k

Bowtie didn't support clipping, only end-to-end alignment (caveat: relying on memory of old software I haven't used in a long time).

ADD REPLYlink written 19 months ago by harold.smith.tarheel4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 841 users visited in the last hour