Question: Is the Picard SortSam queryname or coordinate equivalent to samtools sort -n?
3
gravatar for jon.brate
5.5 years ago by
jon.brate250
Norway
jon.brate250 wrote:

I don't quite understand what is meant by "coordinate" and "queryname" in the Picard SortSam, and which of them are equivalent to the samtools sort -n option?

sort samtools picard • 9.2k views
ADD COMMENTlink modified 3.7 years ago by John12k • written 5.5 years ago by jon.brate250
8
gravatar for Devon Ryan
5.5 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Coordinate is the same as samtools sort

Queryname is equivalent to samtools sort -n

BTW, as pointed out by Dariober, queryname sorting just guarantees that alignments from the same reads (or read pairs) will be grouped together. Picard/samtool/sort have slightly different methods of sorting strings, so the exact ordering of the groups may differ between them.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Devon Ryan92k
5
gravatar for John
3.7 years ago by
John12k
Germany
John12k wrote:

Just as a bit of colour for this post, the three ways query name can be sorted as pointed out by dariober are:

samtools -n: Uses a natural sort. The check in python would be:

def nat(qname): return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', qname)]
it = iter(qnames); it.next()
all(nat(b) >= nat(a) for a, b in itertools.izip(qnames, it))

picard SortSam SORT_ORDER=queryname: Uses an ASCII sort:

it = iter(qnames); it.next()
all(b >= a for a, b in itertools.izip(qnames, it))     # returns True or False accordingly

unix sort: Uses a "dictionary sort" which is unix for "ASCII sort for only alphanumeric values":

def alphanumeric(qname): [ s for s in qnames[0] if s.isalnum() ]
it = iter(qnames); it.next() # always 1 ahead
all(alphanumeric(b) >= alphanumeric(a) for a, b in itertools.izip(qnames, it))

When sorting by position, fortunately samtools and Picard are exactly the same. This is because they don't try and sort the chromosomal order (using the same incompatible method as the qname), but instead use whatever is already in the head of the SAM/BAM file. Since neither tool lets you sort without a header, we get compatibility here :) Also, I think most mappers return the chromosomes in natural-sort order (chr1, chr2, ... chr10), which is nice. Maybe thats just how the genome.fa comes.

With unix's sort -k3,3 -k4,4n however, which you'll see scattered around the place for sorting SAM files, you'll get chromosomes in the same dictionary-sort order as unix sort gives the qnames (chr1, chr10, ... chr2). This of course is a really really bad idea.

ADD COMMENTlink modified 3.6 years ago • written 3.7 years ago by John12k
1

samtools -n: Uses an ASCII sort.

No, this is certainly not true, see e.g. https://github.com/samtools/hts-specs/issues/5

Samtools does a natural sort per subfield (which are separated by ':'), i.e. lexicographic sort for string-subfields, and numeric sort for integer sub-field. (Very annoying BTW, as it means you can't use Unix tools like comm and join).

I believe picardtools sort -n do lexigraphic sort, like Unix. But I'll check it first.

ADD REPLYlink written 3.6 years ago by plijnzaad40

Sorry, you're absolutely right - even though I wrote a tool to detect sort order, I got samtools and picard the wrong way around in the post.

unix sort being different from picard and samtools is still true though, as is everything about the headers, etc.

enter image description here

enter image description here

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by John12k
2

Hi, I got confused a bit myself regarding the difference between picard sort order and Unix sort order. It turns out to be slightly nasty. Picard sort uses ASCII sort order, whereas the Unix sort uses 'lexicographic' order, which depends on the locale. The nasty bit is that in many locales, including the very commonly used en_US (and derivatives), the ':' character sorts before the numbers 0-9, whereas in ASCII, the ':' sorts after the numbers 0-9. In the context of SAM read identifier fields, this is clearly a very significant difference (which you, incidentally, may not notice in the first few hundred lines when comparing visually). What to do? Well simply switch your locale to ASCII, by doing

export LC_ALL=C; # now sort/join/compare stuff here

or, temporarily, along the lines of

samtools view file.bam | env LC_ALL=C sort | samtools view -b > file-sorted.bam

If you do it like this (and you also use the LC_ALL=C setting when using join (1) or comm (1), things seem to work.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by plijnzaad40

Ay, you're right again - I thought unix sort does by default a dictionary sort (-d), because when i sort files with unix sort on linux I get back something that fails a ASCII sort test, but passes a dictionary sort test. But actually it seems the default is as you say, an ASCII sort that just so happened to fail a python ASCII sort (due to locale) and pass a dictionary sort, because it removes the ambiguous characters like ':'

So i should change my code to check for ASCII sorts of multiple locales. Thank you plijnzaad :) That would have been a very difficult one to debug without your help :)

EDIT: Also, I still think you should never-ever-ever do a unix sort :P (or even a sort on a SAM file at all).

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by John12k
2
gravatar for dariober
5.5 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

queryname seems to be kind of equivalent to samtools sort -n Or I'm missing something?

java -jar SortSam.jar INPUT=test.bam OUTPUT=test.qpic.bam SO=queryname
samtools sort -n test.bam test.nsam

samtools output

samtools view test.nsam.bam | cut -f1 | head
D00408:131:1:1101:1268:49755#TAAGGCGA#NNCTCGGC
D00408:131:1:1101:1270:20060#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1270:20060#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1291:59756#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1291:59756#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1314:9343#TAAGGCGA#NCCTCCGC
D00408:131:1:1101:1314:9343#TAAGGCGA#NCCTCCGC
D00408:131:1:1101:1323:88558#TAAGGCGA#NCGTCGGC
D00408:131:1:1101:1323:88558#TAAGGCGA#NCGTCGGC
D00408:131:1:1101:1338:23494#TAAGGCGA#NCGTCCGC

picard output

samtools view test.qpic.bam | cut -f1 | head
D00408:131:1:1101:10087:85495#TAAGGCGA#TCCTCGGC
D00408:131:1:1101:10087:85495#TAAGGCGA#TCCTCGGC
D00408:131:1:1101:10092:35208#TAAGGCGA#TCCTCCGC
D00408:131:1:1101:10092:35208#TAAGGCGA#TCCTCCGC
D00408:131:1:1101:10104:60166#TAAGGCGA#TCGTCGGC
D00408:131:1:1101:10104:60166#TAAGGCGA#TCGTCGGC
D00408:131:1:1101:10120:16442#TACGGCGA#TCCTCGGC
D00408:131:1:1101:10120:16442#TACGGCGA#TCCTCGGC
D00408:131:1:1101:10125:56648#TAAGGCGA#TCCTCCGC
D00408:131:1:1101:10169:64089#TAAGGCGA#TCCTCGGC

And neither passes the unix sort check, at least on my system settings:

samtools view test.qpic.bam | cut -f1 | sort -c
sort: -:398: disorder: D00408:131:1:1101:1414:61894#TAAGGCGA#NCGTCGGC
samtools view test.nsam.bam | cut -f1 | sort -c
sort: -:332: disorder: D00408:131:1:1101:5242:78722#TAAGGCGA#TCCTCCGC
ADD COMMENTlink modified 5 weeks ago by RamRS24k • written 5.5 years ago by dariober10k
1

The read is the query, so "queryname" literally means "read name". This is also why the read name field in a SAM/BAM file is labeled qname, as opposed to rname, which is the "reference name" aka the chromosome/contig name.

ADD REPLYlink written 5.5 years ago by Devon Ryan92k

Mmm... Sure... In any case SortSam.jar SO=queryname and samtools sort -n should sort by read name, but in my example above they do it in a different way, which is also different from unix sort

ADD REPLYlink written 5.5 years ago by dariober10k
1

I just updated my own answer to mention this, in fact. There are a lot of ways that one can think to order strings. Samtools uses strnum_cmp, while sort uses strcmp, and apparently picard uses yet another method. The point of sorting by queryname is just to group alignments from the same reads/read pairs, so any of these methods will work, even if they do give somewhat different results.

ADD REPLYlink written 5.5 years ago by Devon Ryan92k

If you're ever interested, have a look at this thread from the samtools email list.

ADD REPLYlink modified 5 weeks ago by RamRS24k • written 5.5 years ago by Devon Ryan92k

BTW, the difference in sorting is due to the many different ways of sorting strings of variable length.

ADD REPLYlink written 5.5 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1482 users visited in the last hour