Question

How To Find Unique Reads?

1

Entering edit mode

11.7 years ago

alok.helix ▴ 120

i have am working on a transcriptomics project of a non-model plant and i want to know from you how to count the uniquely mapped reads in sam format or the bam format. I have done the indexing of my file with BWA-mem algo for the paired end illumina data and generated the sam and bam file using samtools. I used the command $ cut -f1 myfile.sam | sort | uniq -u | wc -l but i got zero for all the input file. After closely reading the BWA manual i noticed a TAG of XT which i guess will be used to distinguish unique/repeat/Mate -sw etc. but after using the grep in following command:

$ samtools view bwa.bam | grep "XT:A:U" it yielded zero...

Now i am seriously doubting my commands and getting caught in the vicious circle of various commands including this one:

$ samtools view accepted_hits.bam | awk '$5==255{print $0}' > uniq_mapped_reads.sam

The samtools manual says NO alignment should be assigned a mapping quality of 255. I request you to look into the above query and throw some light on uniquely mapped reads and its searching?? I request you to mail the explanation or links on alok.helix@gmail.com. Thanking you Alok

rnaseq reads • 17k views

ADD COMMENT • link updated 11.7 years ago by KCC ★ 4.1k • written 11.7 years ago by alok.helix ▴ 120

score 3 · Answer 1 · 2013-10-22

3

Entering edit mode

11.7 years ago

Istvan Albert 102k

Note that the SAM specification does not require that any of the tags be present.

bwa-mem is a preeliminary release, as far as I know will not fill in the XT tag (or perhaps it won't fill the XA tag, one or both I can't recall which)

use the bwa aln method or a different aligner

ADD COMMENT • link 11.7 years ago by Istvan Albert 102k

Ram · Answer 2 · 2013-10-22

I once looked at this problem as well. I recall three ways of identifying a read that was mapped to multiple positions with BWA:

the presence of the characters XT:A:R
the presence of the characters XA:Z:
mapping quality (MAPQ) of 0

Based on my notes, these results all gave similar, but slightly different answers. I have found that people in the bioinformatics community tend to favor option 3.

You can just do simple grep searches on the SAM file for lines with the XT:A:R or XA:Z: present for the first two. You can use samtools view -q1 my.bam to get the last one.

Ram · Answer 3 · 2013-10-22

1

Entering edit mode

11.7 years ago

ancient_learner ▴ 680

If your input is in sam format check XS:i:<N> it will only be present if the SAM record is for an aligned read and more than one alignment was found for the read. so grep -v XS filename would solve your problem

ADD COMMENT • link 11.7 years ago by ancient_learner ▴ 680

0

Entering edit mode

Hi I have used the following command

$ samtools view .bam | fgrep "XS:i:0" | wc -l
46,953,719

This definately cannot be uniquley mapped. then I used

$samtools view .bam |fgrep "NM:i:1" | wc -l
5,865,151
$ samtools view .bam | fgrep "XS:i:1" | wc -l
1,862,839

but all are in vain I guess....

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.7 years ago by alok.helix ▴ 120

0

Entering edit mode

kindly read my reply properly "it will only be present if the SAM record is for an aligned read and more than one alignment was found for the read".

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.7 years ago by ancient_learner ▴ 680