How to parse RepeatMasker output
1
0
Entering edit mode
2.0 years ago

I have .out files RepeatMasker gave me after I ran it. It looks like this:

  SW   perc perc perc  query     position in query              matching          repeat                position in repeat
score   div. del. ins.  sequence  begin    end          (left)   repeat            class/family      begin   end    (left)     ID

  5992   15.1  3.1  1.7  2L        42739057 42741982     (4540) + rnd-1_family-153  LTR/Pao                 1   1206     (1) 22037
 4135   13.5  2.5  1.1  2L        42742116 42743472     (3050) + rnd-1_family-122  LTR/Pao                 1    729   (157) 22038
 1796    0.0  3.2  0.0  2L        42743310 42743526     (2996) C rnd-1_family-334  Unknown              (38)    297      74 22039 *

It has different delimiters between columns, some lines have extra "*" symbol (so, differ in length). I want to extract position begin, position end and repeat class/family columns to visualize it. Any suggestions?

bash python RepeatMasker • 1.5k views
ADD COMMENT
0
Entering edit mode
2.0 years ago
Michael 54k

Yes, RepeatMasker .out files are a bit difficult to parse due to using variable number of spaces to pad and visually align columns. I recommend you try BioPerl's Bio::Tools::RepeatMasker class. If you do it manually, try to split each line by the regex: split /\s+/ in perl, or simply what split() would do in python. Important: you need to trim leading and trailing whitespace first, like in line.trim().split() (python) because some lines are padded with whitespace and others are not. This should work in most cases, or you could construct a full regex for a line, but that might break as well. So, my recommendation, if you can install it, use bioperl.

ADD COMMENT
1
Entering edit mode

In addition, RepeatMasker can produce GFF files Edit: and .xn, these are both easier to parse and to convert to BED by a gff2bed script. In particular the .xn files are simple tab-separated files which contain the family.

ADD REPLY
0
Entering edit mode

That's amazing! Thank you

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6