Sam Format Parsing / Stats
3
6
Entering edit mode
12.1 years ago
Darked89 4.2k

I will need to create summaries for a bunch of SAM files produced by Tophat (RNA-Seq spliced mapping). Are there any tools /scripts out there which will give me stats for unique vs non unique matches, spliced vs non_spliced mapping and finally split it into number of mismatches?

EDIT I collected all tags from my SAM file and feed it to explain_sam_flags.py (you can get it from Picard source: http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/scripts/

None of my tags from Tophat's accepted_hits.sam has "not primary alignment". Looks like Tophat reports only unique matches, which is OK for me. Can somebody confirm this?

EDIT 2 "Tophat reports only unique matches" can not be true. Sequence like below have "0" flag. "CAACAACAGCAACAACAACAGCAACAGCAACAGCAACAGCAACAGCAACAACAA". Puzzling.

EDIT 3 (SAM example)

8_96_444_1622   73      scaffold00005   155754  255     54M     *       0       0       ATGTAAAGTATTTCCATGGTACACAGCTTGGTCGTAATGTGATTGCTGAGCCAG  BC@B5)5CBBCCBCCCBC@@7C>CBCCBCCC;57)8(@B@B>ABBCBC7BCC=>  NM:i:0
8_80_1315_464   81      scaffold00005   155760  255     54M     =       154948  0       AGTACCTCCCTGGTACACAGCTTGGTAAAAATGTGATTGCTGAGCCAGACCTTC  B?@?BA=>@>>7;ABA?BB@BAA;@BBBBBBAABABBBCABAB?BABA?BBBAB  NM:i:0
8_17_1222_1577  73      scaffold00005   155783  255     40M1116N10M     *       0       0       GGTAAAAATGTGATTGCTGAGCCAGACCTTCATCATGCAGTGAGAGACGC      BB@BA??>CCBA2AAABBBBBBB8A3@BABA;@A:>B=,;@B=A:BAAAA      NM:i:0  XS:A:+  NS:i:0
8_43_1211_347   73      scaffold00005   155800  255     23M1116N27M     *       0       0       TGAGCCAGACCTTCATCATGCAGTGAGAGACGCAAACATGCTGGTATTTG      #>8<=<@6/:@9';@7A@@BAAA@BABBBABBB@=<A@BBBBBBBBCCBB      NM:i:2  XS:A:+  NS:i:0
8_32_1091_284   161     scaffold00005   156946  255     54M     =       157071  0       CGCAAACATGCTGGTAGCTGTGACACCACATCAACAGCTTGACTATGTTTGTAA  BBBBB@AABACBCA8BBBBBABBBB@BBBBBBA@BBBBBBBBBA@:B@AA@=@@  NM:i:0


Two reads: 8_17_1222_1577 and 8_43_1211_347 are spliced.

My second column tags are: 65 73 81 83 97 99 113 115 129 137 145 147 161 163 177

next-gen sequencing rna short aligner • 17k views
0
Entering edit mode

I just stumbled over the difference between BAM (binary) and SAM (text) formats. Can you give a shortened example of your file?

4
Entering edit mode
12.1 years ago

The is a C API for SAM/BAM as well as a Java API (see the simple example) reading the files and iterating over the records. I think that all the parameters you described are part of the SamRecord class.

3
Entering edit mode

Pierre, your 'everybody' excludes almost everyone on this planet ;)

0
Entering edit mode

I would need to learn Java/C first :-). The good thing: there is "explains SAM flags" @Picard and a python script doing the same. See Edit

0
Entering edit mode

but everybody knows Java and C ! :-)

2
Entering edit mode
12.1 years ago

I wouldn't assume everybody knows R/BioConductor, but if not its worth learning to know it. The Rsamtools package might be a little bit more high-level API. In connection with the GenomicRanges package you can read in a BAM file into a gapped alignment object. Haven't tried this myself, so I don't know if this gives you the statistics you need. If it does, it's likely that the function unique would work on a an alignment object. Worth giving it a look?

Edit:

I just became aware that I wasn't really answering your question, because despite it's name R>sam<tools:

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original.

Rsamtool does not read SAM files. However TAB-separated SAM files can be read in very efficiently using scan in R. The first entry of the list will then contain the names of the reads/paired reads. To find the unique/non-unique hits ratio shoul be therefore something like:

ncol <- 15 # adjust to number of cols in your file
mysam <- scan("myfile.sam", what = list(rep(character(), ncol)), fill=TRUE)
unique.ratio <- length(unique(mysam[[1]]))/ length(mysam[[1]])


To filter out spliced hits vs. non-spliced hits, one has to know how this would be encoded in the tophat output. I am not sure if and how this is marked, e.g. in the flags, in the SAM file.

Regarding the Spliced matches according to your example: looks as if it can be inferred from the CIGAR string in column 6?

40M1116N10M = 40 Matches, 1116 bases ignored, 10 Matches


That would then be an intron of length 1116. Therefor, we have to filter the CIGAR string, that must a least contain 1 N. That can be done with a regexp using grep:

cigar <- mysam[[6]]
intron.ind <- grep("\\d+M\\d+N\\d+M", cigar) # greps M N M patterns
n.intron.matches <- length(intron.ind)


Hope this works, and solves all of the questions (and shows that R rules on the way).

0
Entering edit mode

So far I tripped over installation of Biostrings_2.14.12.tar.gz on Fedora 8/R 2.10.1 (Error in isSingleNumber(length) : element 1 is empty;). Rsamtools require some libs, which in turn require other ones. Well, I will check if it installs on Ubuntu 9.10.

0
Entering edit mode

Looks like Rsamtools lib is a new part of BioC 2.6, which in turn requires R 2.11. Time to upgrade.

0
Entering edit mode

Could be, I read about in on the bio-sig-seq mailing list. Especially with Bioconductor packages it's often necessary to stay with the latest version of R.

1
Entering edit mode
12.1 years ago
Darked89 4.2k

I wrote this simple python script. It works for accepted_hits.sam output from Tophat.

#!/usr/bin/env python

import sys

sam_fn = sys.argv[1]

#init dictionaries:
spliced   = {}
unspliced = {}

for i in range(0,7):
spliced[i]    = 0
unspliced[i]  = 0

for line in open(sam_fn):
if line[0] == "@":
pass
else:
sl = line.split()
cigar, missmatch_str = sl[5], sl[11]
num_mismatches = int(missmatch_str[-1])

if len(cigar) != 3:
spliced[num_mismatches]   += 1
else:
unspliced[num_mismatches] += 1

print "#processed_file:", sam_fn
print "#unspliced, spliced"
for i in range(0,7):
print unspliced[i], spliced[i]

print "###########################################################"

2
Entering edit mode

I do not get how you extract the spliced vs. non spliced information, is it len(cigar) != 3 ? Is that reliable information?

0
Entering edit mode

There is no mapping of fragments shorter than 10 for sure. So anything mapped in unspliced mode shoudl "have 2 digits + M" CIGAR string, at least for my data (paired 54 and 76 nucleotides). It is a hack, meaning it would be better to dig more into CIGAR (other mappers may do mapping with indels but not with splicing). If there is anything like "M1N" in my CIGARs then script gives wrong output. I will recheck it in few mins.

0
Entering edit mode

OK, I checked again to be sure. There is no "M1N" in my CIGAR strings, so no mapping with indels. Minimum intron size used by Tophat was 50.

1
Entering edit mode

This is very dodgy and lazy.