Question

How to parse RepeatMasker output

0

Entering edit mode

2.1 years ago

kirillkirilenko ▴ 40

I have .out files RepeatMasker gave me after I ran it. It looks like this:

  SW   perc perc perc  query     position in query              matching          repeat                position in repeat
score   div. del. ins.  sequence  begin    end          (left)   repeat            class/family      begin   end    (left)     ID

  5992   15.1  3.1  1.7  2L        42739057 42741982     (4540) + rnd-1_family-153  LTR/Pao                 1   1206     (1) 22037
 4135   13.5  2.5  1.1  2L        42742116 42743472     (3050) + rnd-1_family-122  LTR/Pao                 1    729   (157) 22038
 1796    0.0  3.2  0.0  2L        42743310 42743526     (2996) C rnd-1_family-334  Unknown              (38)    297      74 22039 *

It has different delimiters between columns, some lines have extra "*" symbol (so, differ in length). I want to extract position begin, position end and repeat class/family columns to visualize it. Any suggestions?

bash python RepeatMasker • 1.5k views

ADD COMMENT • link updated 2.0 years ago by Michael 54k • written 2.1 years ago by kirillkirilenko ▴ 40

score 0 · Answer 1 · 2022-04-13

0

Entering edit mode

2.1 years ago

Michael 54k

Yes, RepeatMasker .out files are a bit difficult to parse due to using variable number of spaces to pad and visually align columns. I recommend you try BioPerl's Bio::Tools::RepeatMasker class. If you do it manually, try to split each line by the regex: split /\s+/ in perl, or simply what split() would do in python. Important: you need to trim leading and trailing whitespace first, like in line.trim().split() (python) because some lines are padded with whitespace and others are not. This should work in most cases, or you could construct a full regex for a line, but that might break as well. So, my recommendation, if you can install it, use bioperl.

ADD COMMENT • link 2.1 years ago by Michael 54k

1

Entering edit mode

In addition, RepeatMasker can produce GFF files Edit: and .xn, these are both easier to parse and to convert to BED by a gff2bed script. In particular the .xn files are simple tab-separated files which contain the family.