Question: BLAT: how to select best hit at one genomic position? (queries are repeats)
0
gravatar for goubert.clement
3.1 years ago by
France
goubert.clement10 wrote:

Hello everyone,

Searching on several forums, I can't find how to solve my problem, however I'm sure this has been done before. Here is the problem: I'm mapping on one genome a library of repeat sequences (Transposable Elements, TEs) so one query can have multiple match on the genome. However, at one given genomic position, I can have multiple match of different TE from my library, and I want to sort the output file to only keep the best hit at one given location of the genome.

Below is an example seen in the genome browser: the darker long bar represents the best repeat matching one genomic position. I want to select those one among hits.

Blast out

Usually, to perform this kind of analysis, RepeatMasker is used, but I'm not totally satisfied by the result I have, and the way it works is a kind of blackbox. I'm considering using sliding window approach to select of at one given base what is the best TE hit among the different possibilities but I have no clue at all how to do that (and no competencies!).

Thanks a lot for your help and advises,

Best,

Clément

ADD COMMENTlink modified 3.1 years ago by Amitm1.6k • written 3.1 years ago by goubert.clement10

Can you tell us what you are using as query? Is it fasta derived from fastq reads or something else?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax67k

Hello, this is fasta sequences (sizes range from 200-10000bp). I is a manually curated TE library.

ADD REPLYlink written 3.1 years ago by goubert.clement10

Are the query sequences full length? Are you are trying to locate their positions in the scaffolds? Do the query sequences share long stretches of similarity (i.e. can some of the smaller sequences be fully contained in larger ones)?

ADD REPLYlink written 3.1 years ago by genomax67k

Are the query sequences full length? -->YES and NO, some are other are only partial pieces of TEs

Are you are trying to locate their positions in the scaffolds? -->YES

Do the query sequences share long stretches of similarity (i.e. can some of the smaller sequences be fully contained in larger ones)? --> I have clustered them before in order to avoid that at the maximum. They are clustered such as the sequences on one cluster have at least 80% identity, the shortest sequences are clustered to the reference sequence of one cluster if 80% of itslef is in the alignment. Then, I keep only the reference sequences and I map on the genome. The problem is, sometimes some pieces of different cluster match the same genomic position.

Thanks!

ADD REPLYlink written 3.1 years ago by goubert.clement10

Interesting and difficult problem. I take it that the reference used to cluster the TE's is different than the scaffolds being searched against.

If you stick to local alignments then look for hits that cover 100% of the query (or close to) in addition to increasing gap open/extension penalties to filter out some of the partial matches. I am also wondering if you need to start looking at a program that does global alignments instead of local.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax67k

It would help to know what your goal is with this analysis, even in general terms. For example, are you trying to identify TEs in another genome using a custom repeat library, and if so, how divergent are the species you are comparing?

Whatever you goal is, BLAT is probably not the right tool for this task. You can tile across regions to get a contiguous set of hits with this output, but the assumptions of the program aren't going to be appropriate for most applications involving TEs (mainly due to divergence). Most of these issues have been handled by programs like RepeatMasker, so you might want to elaborate on what you are not happy about with that approach.

ADD REPLYlink written 3.1 years ago by SES8.2k
0
gravatar for Amitm
3.1 years ago by
Amitm1.6k
UK
Amitm1.6k wrote:

hi, I am assuming that the TEs are aligning in one contiguous stretch, or small gaps but not like intronic regions. In this scenario, maybe you could try using Bowtie to align. Under default settings it shall report the best hit only. Using the -D param to increase the search space would make it more sensitive when you suspect the query to have multiple hits.

ADD COMMENTlink written 3.1 years ago by Amitm1.6k

Hello Amitm,

Thanks for your answer. Actually, I have a really complex genome where TEs are highly fragmented and inserted the one in the others. This is why I use Blat, that can resolve gaps in one sequence alignment. In addition, I want to be very sensitive in the mapping, because the TE copies can be highly divergent from their consensus (< 80% identity), so I think bowtie wont be that sensitive, but I'll give a try!

ADD REPLYlink written 3.1 years ago by goubert.clement10

Keep in mind that bowtie v1 has an upper query size limit of ~1000 bp.

ADD REPLYlink written 3.1 years ago by genomax67k

Yes, Bowtie 2 is actually referred to in the hyperlink which has a stated no-limit input length. But after reading the detail of Clement's scenario, I think Bowtie2 might be very inadequate in resolving the mapping.

ADD REPLYlink written 3.1 years ago by Amitm1.6k

Yes, I think this is not really made for unfortunately... :(

ADD REPLYlink written 3.1 years ago by goubert.clement10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 645 users visited in the last hour