Question: Alignment Of Thousands Of Short Rnas To Local Database
3
gravatar for Mattrition
8.6 years ago by
Mattrition80
Mattrition80 wrote:

I have a fasta file of about 200,000 RNA sequences and a server holding a local copy of Rfam. For each sequence I want to know the type of RNA it is most related to and ultimately retrieve statistics on the proportion of sequences that are most probably rRNAs, tRNAs, real sRNAs etc.

I'm sure this kind of thing has been done before plenty of times but I can't find much information on it myself. I was going to use BLAST for the job, retrieving the top scoring result from each output report using perl. My question is, is BLAST really capable of handling the local alignment of many very short (16 - 50nt) sequences in a reasonable amount of time?

local blast • 3.4k views
ADD COMMENTlink written 8.6 years ago by Mattrition80

I'll let others answer your direct BLAST question, but as a personal suggestion I would consider using MAFFT, which in fact is capable of doing what you describe.

ADD REPLYlink written 8.6 years ago by Jorge Amigo11k

That looks like a multiple alignment tool. The terminology is confusing here, my apologies. To clarify, I want to align each sequence in my RNA collection one after the other, find out the sequence they are most related to in the Rfam database, and extract information on that RNA's family to use in my statistics. Once I have a candidate family for each individual RNA, I can add up the number of RNAs in my collection that are part of each family and report that as percentages.

ADD REPLYlink written 8.6 years ago by Mattrition80

Soem more important questions. Is this a single organism RNA? Do you know which organism(s) are in the sample? Do you have a reference genome? If the answers are yes, then blast (and MAFFT anyway, you don't want to do MSA on 200.000 sequences!) is the wrong tool.

ADD REPLYlink written 8.6 years ago by Michael Dondrup47k
2
gravatar for ALchEmiXt
8.6 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

You might want to consider BLAT (Blast Like Alignment Tool) which is very fast and produces similar really good to tweak results. We use it for many mapping questions where bowtie or bwa are not in the picture. The BLAT default output tables are easy to parse.

ADD COMMENTlink written 8.6 years ago by ALchEmiXt1.9k
1

I am confused by what you mean by "the best alignments"?! For perfect or near-perfect matches you could consider bowtie and bwa respectively. But my guess would still be that BLAT can do the job.

ADD REPLYlink written 8.6 years ago by ALchEmiXt1.9k

BLAT seems like the most simple solution to this. However, if its like BLAST I still worry that its not going to give me the best alignments.

ADD REPLYlink written 8.6 years ago by Mattrition80

By "best alignments" I mean exactly that BLAST possibly has very poor specificity when it comes to very small sequences. I've not heard of bwa and bowtie though so I will check those out too.

ADD REPLYlink written 8.6 years ago by Mattrition80

bowtie and bwa (as well as stampy and many others) are so-called short-read mappers and will work perfectly but considering stringinty required of the mapping it goes from stringent bowtie, to bwa (tolerates indels) till BLAT tolerates larger mods... you can fine-tune BLAT for tile size and sich. See their manual.

ADD REPLYlink written 8.6 years ago by ALchEmiXt1.9k
1
gravatar for Michael Dondrup
8.6 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Question: where is that data coming from? e.g. Meta-genomic short-reads of bacterial RNA?

There is the tool Rfam_scan to scan a local Rfam database. Could you use this software? You have a large number of sequences though, that might give you some trouble. You might then be able to parallelize by splitting up the input data. Rfam-scan uses Blast for a pre-screening. Could you at least use the same parameters?

Another suggestion: it might be good to use other tools. tRNA-scanSE is generally the best tool for detecting tRNA-genes in whole genomes. How well tRNA-scan works on short reads (especially given your read length is shorter than the tRNA!) you could have to evaluate yourself.

In any case I would run a chained work-flow where I would run blast on a specific database and probably tRNAscan to filter out easy hits. look at this answer for finding rRNA-genes, if this is RNA-seq data, then these reads might constitute the vast majority of the data, then remove the clear hits from further search to reduce the search size and run other more computational expensive tools.

ADD COMMENTlink modified 9 months ago by RamRS27k • written 8.6 years ago by Michael Dondrup47k

Its a short-read transcriptome library. Those are good suggestions, I'll take a look at those perl scripts. However, the other suggestions assume that I am particularly looking for tRNAs and rRNAs, which is not the case. i just need to quickly see what my library is composed of in terms of RNA families.

ADD REPLYlink written 8.6 years ago by Mattrition80

My suggestion to first screen for tRNA and rRNA was mainly inted to reduce the number of input sequences, especially to subtract ribosomal RNA (might be up to 90% of the sample) reads, in order to spead up the subsequent steps. The idea is mainly to run the most efficient reduction as a first step in each pipeline.

ADD REPLYlink written 8.6 years ago by Michael Dondrup47k
0
gravatar for Larry_Parnell
8.6 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

In my experience of trying to recreate published PCR experiments, I don't hold as reliable BLASTN results when the query is less than 16 nt against a (reference) genome database. BLASTN can do the job you propose, but may need to be run twice with different parameters optimal for the shorter and then again the longer length queries. Jorge's offer of MAFFT is good. Smith-Waterman can also do this.

ADD COMMENTlink written 8.6 years ago by Larry_Parnell16k
1

Blast doesn't guarantee the best alignment because it's designed for finding alignments against a large amount of sequence. It uses a heuristic where it looks for a exact match seed sequence and then extends off the seed sequence.

ADD REPLYlink written 8.6 years ago by Damian Kao15k

I'm a bit confused. Everybody seems to be suggesting MAFFT, but from what I've read its used primarily for multiple sequence alignment. What I want to do is a pairwise alignment with every RNA sequence to the database in question. Does MAFFT do this better than BLAST might?

ADD REPLYlink written 8.6 years ago by Mattrition80

That's very good to know. But why is MAFFT better for pairwise alignments?

ADD REPLYlink written 8.6 years ago by Mattrition80

I agree with DK about teh BLAST heuristic, but still, 200K seqs at an avg length of ~35 bp is still 6.5 - 7 MB of data. This is probably a lot larger a database than was used by BLAST for many years. It's not at all like a database of 100 seqs.

ADD REPLYlink written 8.6 years ago by Larry_Parnell16k

Smith-Waterman is the algorithm that yields optimal local alignments, and that Needleman-Wunsch algorithm would yield the optimal global alignments. Now, the problem is they are way too slow, no way you are going to succeed for 200k sequences in you lifetime (maybe exagerated). Then, recommending a MSA program is not helping I guess. The FASTA algorithm/heuristic has higher sensitivity than Blast, but is slower as a tradeof.

ADD REPLYlink written 8.6 years ago by Michael Dondrup47k

Smith-Waterman is the algorithm that yields optimal local alignments, and Needleman-Wunsch algorithm would yield the optimal global alignments. Now, the problem is they are way too slow, no way you are going to succeed for 200k sequences in your lifetime (maybe exagerated), you need a heuristic approach. Then, recommending a MSA program is not helping I guess. The FASTA algorithm/heuristic has higher sensitivity than Blast, but is slower as a tradeof.

ADD REPLYlink written 8.6 years ago by Michael Dondrup47k
0
gravatar for Damian Kao
8.6 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

I don't think blast is the best tool to use when you have short sequences. Best you can do is adjust word size to shortest possible (7 for blastn, 2 for blastx) and have a low e-value threshold (1E-100 or something).

I would use an aligner like the one Jorge suggested and do a bunch of pair wise alignments to find the optimal alignment.

ADD COMMENTlink written 8.6 years ago by Damian Kao15k

For BLASTN, window of 7 is right, but for this specialized database of 200K seqs, 1e-100 for e-value threshold is far too stringent, certainly for the smaller queries.

ADD REPLYlink written 8.6 years ago by Larry_Parnell16k

Yeah you are right. The search space is pretty small. Maybe look at the bitscore instead of the e-value.

ADD REPLYlink written 8.6 years ago by Damian Kao15k

Exactly as I would do.

ADD REPLYlink written 8.6 years ago by Larry_Parnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 856 users visited in the last hour