Use Megan To Parse Sam File
1
0
Entering edit mode
12.0 years ago
Shuixia100 ▴ 120

Dear all,

I got a question about using MEGAN4 to parsing SAM file.

What I want to do is to get taxonomic and functional annotaion of my raw reads against nr database . As the raw reads is too big (11 million reads in total, 100bp long each) for direct blast against nr database. So I took an approach first assemble my reads into ORFs which I could got blast result easily and then aligned my reads to ORF. Then I want to use MEGAN to parse the alignment of reads to ORF thus get the annotation of raw reads.

Here is what I did exactly:

-I first assembly the reads into contigs

  • then use MetaGeneMark to find open reading frames (ORFs) whose size is suitable to blast against nr database.
  • blast ORFs against nr database
  • import the ORF blast result into MEGAN using default parameters and successfully get the rma file
  • use the Export-Assignments To CSV funtion of MEGAN4 to generate a synomous file which contains two colums (tab seperated): the first one is the name of ORF and second column is the taxonomy ID
  • use bowtie align my raw reads to ORFs and get the SAM file that I want to parse

My problem: Its said on the user manual that import SAM file using the synomous file MEGAN should parse the SAM file, but what I got is all my reads are asigned into two big groups one is called "No hits" and another is "Low complexity". like this:

enter image description here

I have tried it several times, it just works that way. Does anyone know how to fix this? or is there any other alternative method to parse the sam?


blast • 4.0k views
ADD COMMENT
0
Entering edit mode

Just make sure that you have the ORF to taxonomy mapping (synonyms) used during data SAM file import.

From your description it is not clear that you have actually specified the synonyms as parameter during the import phase.

ADD REPLY
0
Entering edit mode

Thanks for you comment. yes I have use the synonyms in the parsing during my try. And Ive wrote to the developer of MEGAN; they told me there is a bug causing this kind of problem and they have updated new version of MEGAN which can parse SAM now.

ADD REPLY
0
Entering edit mode
12.0 years ago
Michael 54k

There are errors in your workflow. First, Megan is for raw reads, so don't assemble reads to avoid chimeric assemblies and allow megan to count the number of reads, second I wouldn't bias your result for predicted coding regions, because you will be loosing a lot of information and I cannot imagine the gene prediction on fragments to work well. Then, if you are comparing DNA against NR you need to use blastx, that is probably the reason for not getting any taxa. However, if you are using bacterial sequences, better use NT with blastn, otherwise you are not picking up interagenic and non-coding sequences. In my experience, blastn or tblastx against AA is best for viral meta genomes where the coverage of the natural variability is so low that the next related genome is too distant on nt level to find anything.

Thus, try with raw reads and blast against nt, then add the Sam file and it should work much better. I just saw that your reads file is too big. Well that is not exactly true, you just need to get enough compute power and split up your files to blast using multiple processes and wait... Alternatively, you may reduce the database size to bacterial taxa only, it is not necessary to have the full nt, or even use a database of 16s only eg SILVA. Even drawing a manageable subset of the input reads randomly and discarding the rest will give you a less skewed analysis than the non-standard workflow you are proposing.

Metagenomics using blast is a very resource intensive analysis, you can try CARMA instead and see if it uses less resources. Assembling the reads is not a viable workflow IMO, because of the problems mentioned above.

ADD COMMENT
0
Entering edit mode

Thanks for your answer :) Here is some of my thoughts regarding to your reply: 1) I tried to do the assembly not only for the purpose to reduce computational cost but also trying to get draft genome and putative genes from the meta data. 2)Sorry for the miss spelling I did use blastx to search against nr. 3)Finally, regarding the previous post, Ive wrote to the developer of MEGAN; they told me there is a bug causing this kind of problem and they have updated new version of MEGAN which can parse SAM now. 4) Again thanks for your suggestion on the CARMA. I do understand there are alternative approaches like MG-RAST to analysis meta reads using faster tools but in my view these tools have to give up the homology searching sensitivity at compensate for computation speed, and thats another reason I insist on using BLAST for my assembled ORFs.

ADD REPLY

Login before adding your answer.

Traffic: 2532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6