Question: Help in interpreting result of magic-blast on nt database
0
gravatar for Fabio Marroni
5 months ago by
Fabio Marroni2.4k
Italy
Fabio Marroni2.4k wrote:

I want to map a small number of reads against a comprehensive database (I used nt) to understand to which organism they do belong (as a confirmation/refinement after using metagenomics software). I can do that in blast, but I though that magic-blast could be better. The results I got are reasonable as far as the reads provenance is concerned, but I have some issue in understanding fully the SAM output format.

Assuming nt is the name of the local nt database I issued this command.

magicblast -query read_1.fastq -query_mate read_2.fastq -infmt fastq -db nt -no_unaligned -num_threads 8 -outfmt sam -out myout.sam

The sam file for a selected read pair looks like this

D00352:469:CD2JDANXX:6:2211:12452:28036 73  gi|400173048    1322861 255 86M39S  *   0   0   TTATTCGGCGTGGTCGCACCGCTGGCGATCACCGGAGCATCAATCGCCAGCAGTGAGCCGGTCAGGGTGACGCTGGTGGAAACAATACGCTGCGGCTGGCTTTCCAGTGTATGTGTGCCACGGCT   *   NH:i:4  AS:i:56 NM:i:6  MD:Z:29A5C8T8G14C11G5

D00352:469:CD2JDANXX:6:2211:12452:28036 345 gi|723220061    3480222 255 39S86M  *   0   0   AGCCGTGGCACACATACACTGGAAAGCCAGCCGCAGCGTATTGTTTCCACCAGCGTCACCCTGACCGGCTCACTGCTGGCGATTGATGCTCCGGTGATCGCCAGCGGTGCGACCACGCCGAATAA   *   NH:i:4  AS:i:56 NM:i:6  MD:Z:5C11G14C8A8G5T29

D00352:469:CD2JDANXX:6:2211:12452:28036 329 gi|1020388573   2063846 255 86M39S  *   0   0   TTATTCGGCGTGGTCGCACCGCTGGCGATCACCGGAGCATCAATCGCCAGCAGTGAGCCGGTCAGGGTGACGCTGGTGGAAACAATACGCTGCGGCTGGCTTTCCAGTGTATGTGTGCCACGGCT   *   NH:i:4  AS:i:56 NM:i:6  MD:Z:29A5C8T8G14C11G5

D00352:469:CD2JDANXX:6:2211:12452:28036 329 gi|1067038464   1215244 255 86M39S  *   0   0   TTATTCGGCGTGGTCGCACCGCTGGCGATCACCGGAGCATCAATCGCCAGCAGTGAGCCGGTCAGGGTGACGCTGGTGGAAACAATACGCTGCGGCTGGCTTTCCAGTGTATGTGTGCCACGGCT   *   NH:i:4  AS:i:56 NM:i:6  MD:Z:29A5C8T8G14C11G5

D00352:469:CD2JDANXX:6:2211:12452:28036 153 gi|159032194    189508  255 5S20M100S   *   0   0   AAGCGCACAATCCCTCCCCTCGCCCCTTTGGGGAGAGGGTTAGGGTGAGGGGAACAGCCAGCACTGGTGCGAACATTAACCCTCACCCCAGCCCTCACCCTGGAAGGGAGAGGGGGCAGAACGGC   *   NH:i:2  AS:i:20 NM:i:0  MD:Z:20

D00352:469:CD2JDANXX:6:2211:12452:28036 393 gi|850489648    73106717    255 100S20M5S   *   0   0   GCCGTTCTGCCCCCTCTCCCTTCCAGGGTGAGGGCTGGGGTGAGGGTTAATGTTCGCACCAGTGCTGGCTGTTCCCCTCACCCTAACCCTCTCCCCAAAGGGGCGAGGGGAGGGATTGTGCGCTT   *   NH:i:2  AS:i:20 NM:i:0  MD:Z:20

The first four reads are read one (where the 2nd line is the reverse complement), and the last two refer to read 2. All of them - to some extent - are mapped. However, all the relevant fields of the SAM file suggest that no read has a mate mapping.

For example in column 7, we see an "*", and not the name of the contig on which the mate is mapping. Also, all the SAM flags in column 2 include the "mate not mapped" statement.

What I noticed is:

1) Reads that map on different contigs are returned as paired only if they are both unique, e.g. as the one posted below

D00352:469:CD2JDANXX:6:1109:5843:60327  81      gi|1216459868   318360  255     87S21M17S       gi|1273279048   3829668 0       GGCGGTGGCAACGCCGGAAGGCTTAACAACGGTAGCAAGGTAATAAACGTGCCCGCCGCCGCCAGCCCGTAGTTC

D00352:469:CD2JDANXX:6:1109:5843:60327  161     gi|1273279048   3829668 255     125M    gi|1216459868   318360  0       GTGGAGAGCAGCATCAATAGCCCTGGTCGCGCACTATGTGCCAGCTTCCCGCTGGTTAACGCACCAATAGCCGCGCCGAGCGG

2) Read mapping on the same contig (in this case it was because the reads were almost fully overlapped) are returned as mapping in pair also when they have multiple hits. See below.

D00352:469:CD2JDANXX:6:2215:4955:58969  99      gi|1250053806   2713762 255     100M25S =       4339545 1625908 TTATTGCCCGGGGATGGGGAGTAAGCTTAGTCCGCCGTTGTCTCCGTGTGTCGCGTGCGCGGTTGCACGTCAGTCTCAGACGAACCGATGA

D00352:469:CD2JDANXX:6:2215:4955:58969  147     gi|1250053806   4339545 255     125M    =       2713762 -1625908        GAAAGCAATCAGCGATGGTGCTCTGACGGGTTCGAGTTCTGCTGTGATAACGGAGAGAGACTGCGTGTCACGTTCGCGCTGGA

D00352:469:CD2JDANXX:6:2215:4955:58969  353     gi|1347826457   3025751 255     100M25S =       76965   -2948661        TTATTGCCCGGGGATGGGGAGTAAGCTTAGTCCGCCGTTGTCTCCGTGTGTCGCGTGCGCGGTTGCACGTCAGTCTCAGACGA

D00352:469:CD2JDANXX:6:2215:4955:58969  401     gi|1347826457   76965   255     125M    =       3025751 2948661 GAAAGCAATCAGCGATGGTGCTCTGACGGGTTCGAGTTCTGCTGTGATAACGGAGAGAGACTGCGTGTCACGTTCGCGCTGGACTGCTGTG

I already read the post by Pierre (http://plindenbaum.blogspot.com/2016/09/playing-with-magicblast-ncbi-short-read.html), and he had no problems with the number of paired reads (more than 90% were paired in his case). However, the reference that I used is very different, with a lot of small contigs (average length 3800bp) sometimes containing highly similar sequence, and the probability that the two reads of a pair map on the same contig is usually small.

Now, my questions:

1) Is this magic-blast behavior expected?

2) Am I doing something wrong?

3) Would you trust this results or go back to the good old blast? I am pretty confident that results are reliable, but maybe I am biased!

ADD COMMENTlink written 5 months ago by Fabio Marroni2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1260 users visited in the last hour