Perl Script To Summarise The Repeat Masker Out File
1
0
Entering edit mode
10.5 years ago
figo ▴ 220

Hi All

I need to summarize a repeat masker.out file with different repeats types with percentage. Does any body no any tool which can do this or a perl script.

Best

• 4.4k views
ADD COMMENT
0
Entering edit mode

I can help you, but can't understand your question. This is an example of *out - which part you want to summarize?

   SW   perc perc perc  query     position in query    matching  repeat            position in repeat
  score   div. del. ins.  sequence  begin end   (left)   repeat    class/family    begin  end    (left)   ID
   1078   12.3  1.8  0.3  10            GENE1  509    (0) + Repeat1 RepeataA      1    516    (0)    1   
   1099   13.8  0.4  0.0  1000          GENE2   342    (0) C Repeat2 RepeatB   (61)    341      1    2

It would be better if you give an example of your *out file.

Percentage of what? Query covered with a particular repeat?

PS.: Change tag into repeatmasker

ADD REPLY
0
Entering edit mode

thanks for your reply. Repeat masker was run by some one else and I got only the repeat masked outfile. I want to summarise the entire outfile . I don't have the repeat masked summary file.I know how much (bp) was used for the analysis. I want the different percentage of all the different types of elements in the file.

ADD REPLY
4
Entering edit mode
10.5 years ago
PoGibas 5.1k

I understood your question like this:

You want to summarize repeats having only RepeaMasker out file. You want output to be similar to RepeatMasker tbl file.

Using this *out file is example:

 SW   perc perc perc  query     position in query    matching      repeat              position in repeat
score   div. del. ins.  sequence  begin end   (left)   repeat        class/family      begin  end    (left)   ID

225   10.0  0.0  0.0  100016        1    30   (12) + L1P3          LINE/L1               28     57 (6404)     1  
795   15.3  1.5  0.0  100071        1   131    (0) C LTR12_        LTR/ERV1            (83)    605    473     2  
402   13.1  0.0  1.6  100087        1    62    (2) + HERV3-int     LTR/ERV1            6068   6128 (2298)     3  
276   22.5  1.4  0.0  100152       50   120    (0) + L1MDa         LINE/L1               74    145 (6488)     4  
257   13.9  0.0  0.0  100163        5    40    (0) C 7SLRNA        srpRNA             (247)     73     38     5  
274   11.1  0.0  0.0  100164        5    40    (0) C 7SLRNA        srpRNA             (247)     73     38     6  
419   15.2  2.5  1.2  100197       36   114    (0) C AluSc5        SINE/Alu           (118)    191    112     7

And having "I know how much (bp) was used for the analysis" let's say - 123456bp

This is quick and ugly way to get the output similar to tbl file:

grep -v 'SW   perc perc perc\|score   div. del. ins\|^$' EXAMPLE.out |
   awk '{print $7-$6+1,$11"-"$10}' |
   awk '{group[$2]}; {count[$2]+=$1}; END {for (i in group) print i, (count[i]*100)/123456" %"}' |
   sort | 
   column -t

 LINE/L1-L1MDa       0.0567004 %
 LINE/L1-L1P3        0.0234902 %
 LTR/ERV1-HERV3-int  0.0494103 %
 LTR/ERV1-LTR12_     0.105301 %
 SINE/Alu-AluSc5     0.0631804 %
 srpRNA-7SLRNA       0.0567004 %
ADD COMMENT
1
Entering edit mode

thanks for your time and fantastic reply. Just one addition that in this part of script "awk '{print $7-$6,$11"-"$10}' " the repeat element length will be ($7-$6)+1 since the values are the repeat start and repeat end so simple subtraction will cause the decrease of length by 1. So, addition of 1 will be required to get correct length. Thanks

ADD REPLY
0
Entering edit mode

Thanks, fixed it.

ADD REPLY
0
Entering edit mode

If this solution work for you - accept the answer. Also rename question into "coverage from RepeatMasker out file"

ADD REPLY

Login before adding your answer.

Traffic: 3000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6