Question

Perl Script To Summarise The Repeat Masker Out File

0

Entering edit mode

10.5 years ago

figo ▴ 220

Hi All

I need to summarize a repeat masker.out file with different repeats types with percentage. Does any body no any tool which can do this or a perl script.

Best

• 4.4k views

ADD COMMENT • link updated 9.1 years ago by Biostar 20 • written 10.5 years ago by figo ▴ 220

0

Entering edit mode

I can help you, but can't understand your question. This is an example of *out - which part you want to summarize?

   SW   perc perc perc  query     position in query    matching  repeat            position in repeat
  score   div. del. ins.  sequence  begin end   (left)   repeat    class/family    begin  end    (left)   ID
   1078   12.3  1.8  0.3  10            GENE1  509    (0) + Repeat1 RepeataA      1    516    (0)    1   
   1099   13.8  0.4  0.0  1000          GENE2   342    (0) C Repeat2 RepeatB   (61)    341      1    2

It would be better if you give an example of your *out file.

Percentage of what? Query covered with a particular repeat?

PS.: Change tag into repeatmasker

ADD REPLY • link 10.5 years ago by PoGibas 5.1k

0

Entering edit mode

thanks for your reply. Repeat masker was run by some one else and I got only the repeat masked outfile. I want to summarise the entire outfile . I don't have the repeat masked summary file.I know how much (bp) was used for the analysis. I want the different percentage of all the different types of elements in the file.

ADD REPLY • link 10.5 years ago by figo ▴ 220

score 4 · Answer 1 · 2013-11-17

I understood your question like this:

You want to summarize repeats having only RepeaMasker out file. You want output to be similar to RepeatMasker tbl file.

Using this *out file is example:

 SW   perc perc perc  query     position in query    matching      repeat              position in repeat
score   div. del. ins.  sequence  begin end   (left)   repeat        class/family      begin  end    (left)   ID

225   10.0  0.0  0.0  100016        1    30   (12) + L1P3          LINE/L1               28     57 (6404)     1  
795   15.3  1.5  0.0  100071        1   131    (0) C LTR12_        LTR/ERV1            (83)    605    473     2  
402   13.1  0.0  1.6  100087        1    62    (2) + HERV3-int     LTR/ERV1            6068   6128 (2298)     3  
276   22.5  1.4  0.0  100152       50   120    (0) + L1MDa         LINE/L1               74    145 (6488)     4  
257   13.9  0.0  0.0  100163        5    40    (0) C 7SLRNA        srpRNA             (247)     73     38     5  
274   11.1  0.0  0.0  100164        5    40    (0) C 7SLRNA        srpRNA             (247)     73     38     6  
419   15.2  2.5  1.2  100197       36   114    (0) C AluSc5        SINE/Alu           (118)    191    112     7

And having "I know how much (bp) was used for the analysis" let's say - 123456bp

This is quick and ugly way to get the output similar to tbl file:

grep -v 'SW   perc perc perc\|score   div. del. ins\|^$' EXAMPLE.out |
   awk '{print $7-$6+1,$11"-"$10}' |
   awk '{group[$2]}; {count[$2]+=$1}; END {for (i in group) print i, (count[i]*100)/123456" %"}' |
   sort | 
   column -t

 LINE/L1-L1MDa       0.0567004 %
 LINE/L1-L1P3        0.0234902 %
 LTR/ERV1-HERV3-int  0.0494103 %
 LTR/ERV1-LTR12_     0.105301 %
 SINE/Alu-AluSc5     0.0631804 %
 srpRNA-7SLRNA       0.0567004 %