Question

MALT extract reads and BWA aln

0

Entering edit mode

3.5 years ago

v.shapovalova1 ▴ 10

Hello, It is my first using of MALT.

I ran malt-build with all refseq bacteria genomes with mapping file --acc2taxonomy megan-nucl-Jan201.db

malt-run with ancient reads with such flags -id 85 -m BlastN -at SemiGlobal -top 1 -supp 0.01 -mq 100. open .rma6 file in Megan desktop, click at the species, then File --> Extract Reads

I got .fasta file with reads that assigned to the species.

Then I decided to validate this assignments. I’ve taken ID of the reference from .blast file (also download after clicking at the specie) and run BWA: bwa aln ref_index extracted_reads_from_megan.fasta > result.bwa

bwa samse ref_index result.bwa extracted_reads_from_megan.fasta > result.sam

And there are no hits in samtools flagstat. Then I tested with reference genome of this species from Refseq and also there are no hits. Should I have more strict parameters with build-run? When I tried to blastn some of these reads on the site, it showed the hits with [Eukaryotic synthetic construct chromosome 20]. The reads for malt-run was from unmapped of human. What could be the mistake?

A lot of thanks, Valery

ancient metagenomic megan malt • 1.1k views

ADD COMMENT • link updated 3.0 years ago by Lesley Sitter ▴ 600 • written 3.5 years ago by v.shapovalova1 ▴ 10

score 0 · Answer 1 · 2021-10-23

So MALT is a weird beast... it's not a taxanomic identifier in the broader sence... it's a reference based identifier So if your reference for example has two species of a bacteria but not the third one, most likely you will get hits going to one or the other, and generate a false positive there because you didn't have the true positive in there. Partial hits are still hits.

So with aDNA you off course want to allow for mismatches, but that does not mean you can set your ID too low, otherwise you pass the threshold where it still works for C>T and G>A base changes etc, into the area where you are just matching mice DNA to a human (they are also about 85% identical in the boarder sense). Additionally, you have the problem of short fragments, which are already increasingly error prone as the fewer bases you have the less unique a sequence becomes.

SO! without showing any examples and input/output, i off course cannot do anything with the information you provided. I don't even know what version of Malt you are working with or if your reads map to multiple organisms, what you species of interest are etc... so you need to provide much more detail for people to help you here...

But in general i would say 85% is waaaaaay too low. If you want confident calls, stick with 95% (this allows for 1 mismatch in a 30bp read, and up to 5 in a 100bp read, longer reads provide more confidence so it makes sense to allow mismatches only in the longer reads, and hardly any in the shorter reads). If your sample is UDG treated, this should be enough, if not, you can go maybe to 90-93%, but be warned that this will just increase the background noise...

Secondly, if you took a tiny database as your reference, you are going to have false positives... the power of MALT lies in comparative taxonomic identification... i.e. if a read matches to 20 organisms, the LCA algorithm can appoint the read to a higher up node in the taxonomic tree, essentially cleaning up your output. So if you only have 5 genomes in there, you will get matches to repeats, repetetive motives, transposons, integrases, insertion sequences, common elements shared among all organisms like 16S and 18S (remember, 85% will allow for several mismatches, so in these ribosomal areas that can be the difference between a hamster and a whale maybe)

Lastly, the reference could have issues... the famous case of i think it was the Cyprinus carpio genome... everyone gets carp in their data... so people assumed carp was just everywhere... But it turns out the cap genome was sequenced but without adapter removal, so some generic illumina adapters have been assembled along with the genome, so if you didn't do adapter trimming beforehand these long sequences might match to this part of the genome... But other genomes have simmilar issues, Ovis canadensis i think it was has a partial pseudomonas contig in it's genome.... so if you have pseudomonas in your sample (common bacterium) you might get hits to this mountain goat but it's just a informatics artefact :(