Question: Sam Format Parsing / Stats
gravatar for Darked89
9.3 years ago by
Barcelona, Spain
Darked894.2k wrote:

I will need to create summaries for a bunch of SAM files produced by Tophat (RNA-Seq spliced mapping). Are there any tools /scripts out there which will give me stats for unique vs non unique matches, spliced vs non_spliced mapping and finally split it into number of mismatches?

EDIT I collected all tags from my SAM file and feed it to (you can get it from Picard source:

None of my tags from Tophat's accepted_hits.sam has "not primary alignment". Looks like Tophat reports only unique matches, which is OK for me. Can somebody confirm this?

EDIT 2 "Tophat reports only unique matches" can not be true. Sequence like below have "0" flag. "CAACAACAGCAACAACAACAGCAACAGCAACAGCAACAGCAACAGCAACAACAA". Puzzling.

EDIT 3 (SAM example)

8_96_444_1622   73      scaffold00005   155754  255     54M     *       0       0       ATGTAAAGTATTTCCATGGTACACAGCTTGGTCGTAATGTGATTGCTGAGCCAG  BC@B5)5CBBCCBCCCBC@@7C>CBCCBCCC;57)8(@B@B>ABBCBC7BCC=>  NM:i:0
8_80_1315_464   81      scaffold00005   155760  255     54M     =       154948  0       AGTACCTCCCTGGTACACAGCTTGGTAAAAATGTGATTGCTGAGCCAGACCTTC  B?@?BA=>@>>7;ABA?BB@BAA;@BBBBBBAABABBBCABAB?BABA?BBBAB  NM:i:0
8_17_1222_1577  73      scaffold00005   155783  255     40M1116N10M     *       0       0       GGTAAAAATGTGATTGCTGAGCCAGACCTTCATCATGCAGTGAGAGACGC      BB@BA??>CCBA2AAABBBBBBB8A3@BABA;@A:>B=,;@B=A:BAAAA      NM:i:0  XS:A:+  NS:i:0
8_43_1211_347   73      scaffold00005   155800  255     23M1116N27M     *       0       0       TGAGCCAGACCTTCATCATGCAGTGAGAGACGCAAACATGCTGGTATTTG      #>8<=<@6/:@9';@7A@@BAAA@BABBBABBB@=<A@BBBBBBBBCCBB      NM:i:2  XS:A:+  NS:i:0
8_32_1091_284   161     scaffold00005   156946  255     54M     =       157071  0       CGCAAACATGCTGGTAGCTGTGACACCACATCAACAGCTTGACTATGTTTGTAA  BBBBB@AABACBCA8BBBBBABBBB@BBBBBBA@BBBBBBBBBA@:B@AA@=@@  NM:i:0

Two reads: 8_17_1222_1577 and 8_43_1211_347 are spliced.

My second column tags are: 65 73 81 83 97 99 113 115 129 137 145 147 161 163 177

ADD COMMENTlink modified 9.3 years ago • written 9.3 years ago by Darked894.2k

I just stumbled over the difference between BAM (binary) and SAM (text) formats. Can you give a shortened example of your file?

ADD REPLYlink written 9.3 years ago by Michael Dondrup46k
gravatar for Pierre Lindenbaum
9.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

The is a C API for SAM/BAM as well as a Java API (see the simple example) reading the files and iterating over the records. I think that all the parameters you described are part of the SamRecord class:

ADD COMMENTlink written 9.3 years ago by Pierre Lindenbaum122k

Pierre, your 'everybody' excludes almost everyone on this planet ;)

ADD REPLYlink written 9.3 years ago by Eric Normandeau10k

I would need to learn Java/C first :-). The good thing: there is "explains SAM flags" @Picard: and a python script doing the same. See Edit

ADD REPLYlink written 9.3 years ago by Darked894.2k

but everybody knows Java and C ! :-)

ADD REPLYlink written 9.3 years ago by Pierre Lindenbaum122k
gravatar for Michael Dondrup
9.3 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I wouldn't assume everybody knows R/BioConductor, but if not its worth learning to know it. The Rsamtools package might be a little bit more high-level API. In connection with the GenomicRanges package you can read in a BAM file into a gapped alignment object. Haven't tried this myself, so I don't know if this gives you the statistics you need. If it does, it's likely that the function unique would work on a an alignment object. Worth giving it a look?


I just became aware that I wasn't really answering your question, because despite it's name R>sam<tools:

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original.

Rsamtool does not read SAM files. However TAB-separated SAM files can be read in very efficiently using scan in R. The first entry of the list will then contain the names of the reads/paired reads. To find the unique/non-unique hits ratio shoul be therefore something like:

ncol <- 15 # adjust to number of cols in your file
# adapt the what part to fit your file:
mysam <- scan("myfile.sam", what = list(rep(character(), ncol)), fill=TRUE)
unique.ratio <- length(unique(mysam[[1]]))/ length(mysam[[1]])

To filter out spliced hits vs. non-spliced hits, one has to know how this would be encoded in the tophat output. I am not sure if and how this is marked, e.g. in the flags, in the SAM file.

Regarding the Spliced matches according to your example: looks as if it can be inferred from the CIGAR string in column 6?

40M1116N10M = 40 Matches, 1116 bases ignored, 10 Matches

That would then be an intron of length 1116. Therefor, we have to filter the CIGAR string, that must a least contain 1 N. That can be done with a regexp using grep:

cigar <- mysam[[6]]
intron.ind <- grep("\\d+M\\d+N\\d+M", cigar) # greps M N M patterns
n.intron.matches <- length(intron.ind)

Hope this works, and solves all of the questions (and shows that R rules on the way).

ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.3 years ago by Michael Dondrup46k

So far I tripped over installation of Biostrings_2.14.12.tar.gz on Fedora 8/R 2.10.1 (Error in isSingleNumber(length) : element 1 is empty;). Rsamtools require some libs, which in turn require other ones. Well, I will check if it installs on Ubuntu 9.10.

ADD REPLYlink written 9.3 years ago by Darked894.2k

Looks like Rsamtools lib is a new part of BioC 2.6, which in turn requires R 2.11. Time to upgrade.

ADD REPLYlink written 9.3 years ago by Darked894.2k

Could be, I read about in on the bio-sig-seq mailing list. Especially with Bioconductor packages it's often necessary to stay with the latest version of R.

ADD REPLYlink written 9.3 years ago by Michael Dondrup46k
gravatar for Darked89
9.3 years ago by
Barcelona, Spain
Darked894.2k wrote:

I wrote this simple python script. It works for accepted_hits.sam output from Tophat.

#!/usr/bin/env python

import sys

sam_fn = sys.argv[1]

#init dictionaries:
spliced   = {}
unspliced = {}

for i in range(0,7):
    spliced[i]    = 0
    unspliced[i]  = 0

for line in open(sam_fn):
    if line[0] == "@":
        sl = line.split()
        cigar, missmatch_str = sl[5], sl[11]
        num_mismatches = int(missmatch_str[-1])

        if len(cigar) != 3:
            spliced[num_mismatches]   += 1  
            unspliced[num_mismatches] += 1

print "#processed_file:", sam_fn
print "#unspliced, spliced"
for i in range(0,7):
    print unspliced[i], spliced[i]

print "###########################################################"
ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.3 years ago by Darked894.2k

I do not get how you extract the spliced vs. non spliced information, is it len(cigar) != 3 ? Is that reliable information?

ADD REPLYlink written 9.3 years ago by Michael Dondrup46k

There is no mapping of fragments shorter than 10 for sure. So anything mapped in unspliced mode shoudl "have 2 digits + M" CIGAR string, at least for my data (paired 54 and 76 nucleotides). It is a hack, meaning it would be better to dig more into CIGAR (other mappers may do mapping with indels but not with splicing). If there is anything like "M1N" in my CIGARs then script gives wrong output. I will recheck it in few mins.

ADD REPLYlink written 9.3 years ago by Darked894.2k

OK, I checked again to be sure. There is no "M1N" in my CIGAR strings, so no mapping with indels. Minimum intron size used by Tophat was 50.

ADD REPLYlink written 9.3 years ago by Darked894.2k

This is very dodgy and lazy.

ADD REPLYlink written 4.4 years ago by SmallChess490
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 495 users visited in the last hour