How to blastn a very short query sequence? Identifying telomeres in assembly
3
0
Entering edit mode
4.2 years ago
mrj ▴ 170

hello, I have a fungal genome assembled into 222 contigs (contig.fasta). I would like to detect characteristic telemeric repeats (TTAGGG/CCCTAA) in the contigs. My approach was to create an indexed databse using the fasta file containing 222 contigs and performing blastn using a very short quesry sequence (tel1.fasta) (6 nucleotides only).

>query sequence
TTAGGG

following is the command I used. It does not return matches surprisingly even though I know that there are prenty of matches. Can you explain why it is happening and how to prevent it?

blastn -query tel1.fasta -db contig.fasta -task "blastn-short" -outfmt 7 -max_target_seqs 10 -evalue 0.5 -perc_identity 90

Is there any other way to detect telomeres in a given sequence assembly?

Thanks

blastn blast shortquery telomere • 1.6k views
ADD COMMENT
1
Entering edit mode

I don't think you can use the ". Try just blastn-short.

That works only for sequences 10 bp and longer according to this post: A: Blast Settings For Short Sequences

You may want to tryfuzznuc from EMBOSS for real short sequences like that.

ADD REPLY
0
Entering edit mode

The minimal seed length is 7 for blast actually. Still too much for this, unless you blast on two or more tandem repeats. For instance:

>query sequence
TTAGGGTTAGGGTTAGGG
ADD REPLY
0
Entering edit mode

That's a good one. Yes, i noticed that there were tandem repeats.

ADD REPLY
0
Entering edit mode

Thanks genomax. I need to try fuzznuc as you suggested.

ADD REPLY
0
Entering edit mode

Do you actually need alignment for this, or are the matches (reasonably) well conserved? Fuzzy matching, as already mentioned, or a regex approach might be sufficient?

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. This sequence is a characteristic repeat. regex worked!!!!!

ADD REPLY
1
Entering edit mode
4.2 years ago

If you are looking for exact repeats of at least n TTAGGG, you could just perhaps just use grep.

ADD COMMENT
0
Entering edit mode

I used grep and it worked like a charm. Thanks a lot

ADD REPLY
0
Entering edit mode

Hi mrj, if you have a moment, please consider upvoting and/or marking this answer as accepted. This helps others down the road, who may ask a similar question and find this answer useful!

ADD REPLY
1
Entering edit mode
4.2 years ago

Just spitballin':

Could you build a quasi-PWM MEME motif table file from your telomeric repeat motif, and then use FIMO to scan your contig FASTA for hits of various quality (q-value/p-value)? FIMO is kind enough to look for reverse-strand hits, as well. For instance, here's an example of how to search human assembly for TF motifs, which could be extrapolated to your situation with some alterations to inputs: https://bioinformatics.stackexchange.com/a/2491/776

The sequence is short enough, and your contigs likely short enough, that another option might be to programmatically build a set of off-target sequences (one mismatch, two mismatches, etc., along with their reverse complements) and search for those as regex patterns via a command-line script (using grep or re library in Python).

If you get same-strand hits back that are spaced every N bases (N being the length of the motif, e.g. N=6) then you may have located a telomeric repeat.

ADD COMMENT
1
Entering edit mode
4.2 years ago
mb314 ▴ 20

I think the program HOMER might be able to search for your motif from a fasta file. Check out this link!

http://homer.ucsd.edu/homer/motif/genomeWideMotifScan.html

ADD COMMENT

Login before adding your answer.

Traffic: 2010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6