How to split the Repetitive DNA file (rmsk) for human according to class and family using R
2
0
Entering edit mode
9.2 years ago
M K ▴ 660

Hi,

I downloaded the Repetitive DNA file (rmsk) for human from UCSC website and I want to split this file according to class and family to get some basic statistics using R.

sequence genome next-gen • 3.6k views
ADD COMMENT
0
Entering edit mode

What have you tried and are you using the .out files or something from the table browser?

ADD REPLY
0
Entering edit mode

Yes. I download this file from the following link

ADD REPLY
0
Entering edit mode

what kind of statistics, do you need the DNA sequences ?

ADD REPLY
0
Entering edit mode

I want to get some basic statistics like frequencies for each family to and class to compare this file with mouse to see if there is any relation between them according to families and classes.

ADD REPLY
0
Entering edit mode
9.2 years ago

Using mysql ucsc (used mouse here, use hg19.simpleRepeat or hg19.nestedRepeat for human )

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D mm10 -e 'select repClass,repFamily,count(*) from rmsk group by 1,2'
+----------------+----------------+----------+
| repClass       | repFamily      | count(*) |
+----------------+----------------+----------+
| DNA            | DNA            |     1001 |
| DNA            | hAT            |     1949 |
| DNA            | hAT-Blackjack  |     4564 |
| DNA            | hAT-Charlie    |   105698 |
| DNA            | hAT-Tip100     |     9331 |
| DNA            | hAT-Tip100?    |      105 |
| DNA            | hAT?           |      585 |
| DNA            | MuDR           |      153 |
| DNA            | MULE-MuDR      |      583 |
| DNA            | PiggyBac       |      209 |
| DNA            | PiggyBac?      |      141 |
| DNA            | TcMar          |       52 |
| DNA            | TcMar-Mariner  |     1079 |
| DNA            | TcMar-Pogo     |       21 |
| DNA            | TcMar-Tc2      |     1786 |
| DNA            | TcMar-Tigger   |    35118 |
| DNA            | TcMar?         |      702 |
| DNA?           | DNA?           |     1027 |
| LINE           | CR1            |    14155 |
| LINE           | Dong-R4        |      138 |
| LINE           | L1             |   905176 |
| LINE           | L1?            |       52 |
| LINE           | L2             |    67909 |
| LINE           | RTE-BovB       |      260 |
| LINE           | RTE-X          |     1703 |
| LINE?          | Penelope?      |       42 |
| Low_complexity | Low_complexity |   386539 |
| LTR            | ERV1           |    71980 |
| LTR            | ERV1?          |      115 |
| LTR            | ERVK           |   319317 |
| LTR            | ERVK?          |     4185 |
| LTR            | ERVL           |   118061 |
| LTR            | ERVL-MaLR      |   454918 |
| LTR            | ERVL?          |      520 |
| LTR            | Gypsy          |     1859 |
| LTR            | Gypsy?         |      819 |
| LTR            | LTR            |      819 |
| LTR?           | LTR?           |      941 |
| Other          | Other          |    19450 |
| RC             | Helitron       |      345 |
| RC?            | Helitron?      |       74 |
| RNA            | RNA            |      691 |
| rRNA           | rRNA           |     1564 |
| Satellite      | centr          |        4 |
| Satellite      | Satellite      |    36865 |
| scRNA          | scRNA          |     8332 |
| Simple_repeat  | Simple_repeat  |  1015643 |
| SINE           | Alu            |   574557 |
| SINE           | B2             |   372923 |
| SINE           | B4             |   397726 |
| SINE           | Deu            |     1702 |
| SINE           | ID             |    64047 |
| SINE           | MIR            |   120436 |
| SINE           | tRNA           |     1618 |
| SINE?          | SINE?          |      274 |
| snRNA          | snRNA          |     3007 |
| srpRNA         | srpRNA         |      437 |
| tRNA           | tRNA           |     4769 |
| Unknown        | Unknown        |     6791 |
| Unknown        | Y-chromosome   |     2869 |
+----------------+----------------+----------+
ADD COMMENT
0
Entering edit mode

Thank, Pierre. I did that, but how can I split the rmsk file in R I need the information for each class and each family separately.

ADD REPLY
0
Entering edit mode
9.2 years ago

If you really want to use R (using mysql is actually faster), then you just want the split() command.

ADD COMMENT
0
Entering edit mode

I don't have experience for using mysql, also I want all information for each class and family like chr name, strand,...etc.

ADD REPLY

Login before adding your answer.

Traffic: 2011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6