How To Use The Recon Package To Identify Repeat Families From Genomic Sequences
1
0
Entering edit mode
11.2 years ago
Nathan Fig • 0

I'm trying to learn RECON and am experimenting using chr22. My steps so far, roughly:

  1. Make blast database from chr22.fa
  2. BLAST chr22.fa against its own database
  3. Run MSPCollect.pl (RECON provided script) to create an MSP file
  4. Run recon.pl on the MSP file and a list of sequence IDs

    However, blasting a sequence against its own database takes a prohibitively long time or results in 100% self hits. If I remove the self hits, I'm left with a bunch of alignments that RECON is then happy to work with, but I had to write my own Python script to filter those out.

    By now this is all feeling very convoluted, so my question is: Am I even close to doing the BLAST part correctly? The RECON home page says to avoid self hits for performance sake, but I haven't been able to discover how. Any other glaring mistakes?

blastn blast denovo • 3.9k views
ADD COMMENT
1
Entering edit mode

maybe you need to use something more complete like RepeatModeler (http://repeatmasker.org/RepeatModeler.html) or Repet (http://urgi.versailles.inra.fr/Tools/REPET).

ADD REPLY
0
Entering edit mode

I have sent an email to the maintainer of this package via the request help link above.

ADD REPLY
0
Entering edit mode

Ah neat feature; thanks.

ADD REPLY
1
Entering edit mode
9.7 years ago
SES 8.6k

You want to avoid self hits because it complicates the graph structure and doesn't add any useful information. As you know, comparing a sequence to itself will result in all self hits, and it will take a very long time for sequences that are megabases in length. My advice would be to split the input into overlapping sequences and run blast on subsets of that in parallel. If you output your blast as tab-delimited it will also save space. Since the blast to MSP conversion script provided with RECON takes only a standard blast report I am providing a script below to convert a tab-delimited blast to the RECON MSP format.

This assumes you have many input sequences, so comment line 16 as indicated above if doing a self comparison of one sequence. As JC mentioned in the comments, RECON is likely not the best program for your task (it is buggy, seg faults for silly reasons, produces a very high percentage of artifacts, etc.), but I hope this helps someone.

ADD COMMENT

Login before adding your answer.

Traffic: 1777 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6