Question: How to split the Repetitive DNA file (rmsk) for human according to class and family using R
0
gravatar for M K
5.1 years ago by
M K490
United States
M K490 wrote:

Hi,

I downloaded  the Repetitive DNA file (rmsk) for human from UCSC website and I want to split this file according to class and family  to get some basic statistics using R.

 

sequence next-gen genome • 2.3k views
ADD COMMENTlink modified 5.1 years ago by Devon Ryan94k • written 5.1 years ago by M K490

What have you tried and are you using the .out files or something from the table browser?

ADD REPLYlink written 5.1 years ago by Devon Ryan94k

Yes. I download this file from the following link 

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by M K490

what kind of statistics, do you need the DNA sequences ?

ADD REPLYlink written 5.1 years ago by Pierre Lindenbaum126k

I want to get some basic statistics like frequencies for each family to and class to compare this file with mouse to see if there is any relation between them according to families and classes.

ADD REPLYlink written 5.1 years ago by M K490
0
gravatar for Pierre Lindenbaum
5.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

using mysql ucsc (used mouse here, use hg19.simpleRepeat or hg19.nestedRepeat for human )

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D mm10 -e 'select repClass,repFamily,count(*) from rmsk group by 1,2'
+----------------+----------------+----------+
| repClass       | repFamily      | count(*) |
+----------------+----------------+----------+
| DNA            | DNA            |     1001 |
| DNA            | hAT            |     1949 |
| DNA            | hAT-Blackjack  |     4564 |
| DNA            | hAT-Charlie    |   105698 |
| DNA            | hAT-Tip100     |     9331 |
| DNA            | hAT-Tip100?    |      105 |
| DNA            | hAT?           |      585 |
| DNA            | MuDR           |      153 |
| DNA            | MULE-MuDR      |      583 |
| DNA            | PiggyBac       |      209 |
| DNA            | PiggyBac?      |      141 |
| DNA            | TcMar          |       52 |
| DNA            | TcMar-Mariner  |     1079 |
| DNA            | TcMar-Pogo     |       21 |
| DNA            | TcMar-Tc2      |     1786 |
| DNA            | TcMar-Tigger   |    35118 |
| DNA            | TcMar?         |      702 |
| DNA?           | DNA?           |     1027 |
| LINE           | CR1            |    14155 |
| LINE           | Dong-R4        |      138 |
| LINE           | L1             |   905176 |
| LINE           | L1?            |       52 |
| LINE           | L2             |    67909 |
| LINE           | RTE-BovB       |      260 |
| LINE           | RTE-X          |     1703 |
| LINE?          | Penelope?      |       42 |
| Low_complexity | Low_complexity |   386539 |
| LTR            | ERV1           |    71980 |
| LTR            | ERV1?          |      115 |
| LTR            | ERVK           |   319317 |
| LTR            | ERVK?          |     4185 |
| LTR            | ERVL           |   118061 |
| LTR            | ERVL-MaLR      |   454918 |
| LTR            | ERVL?          |      520 |
| LTR            | Gypsy          |     1859 |
| LTR            | Gypsy?         |      819 |
| LTR            | LTR            |      819 |
| LTR?           | LTR?           |      941 |
| Other          | Other          |    19450 |
| RC             | Helitron       |      345 |
| RC?            | Helitron?      |       74 |
| RNA            | RNA            |      691 |
| rRNA           | rRNA           |     1564 |
| Satellite      | centr          |        4 |
| Satellite      | Satellite      |    36865 |
| scRNA          | scRNA          |     8332 |
| Simple_repeat  | Simple_repeat  |  1015643 |
| SINE           | Alu            |   574557 |
| SINE           | B2             |   372923 |
| SINE           | B4             |   397726 |
| SINE           | Deu            |     1702 |
| SINE           | ID             |    64047 |
| SINE           | MIR            |   120436 |
| SINE           | tRNA           |     1618 |
| SINE?          | SINE?          |      274 |
| snRNA          | snRNA          |     3007 |
| srpRNA         | srpRNA         |      437 |
| tRNA           | tRNA           |     4769 |
| Unknown        | Unknown        |     6791 |
| Unknown        | Y-chromosome   |     2869 |
+----------------+----------------+----------+
ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Pierre Lindenbaum126k

Thank, Pierre. I did that, but how can I split the rmsk file in R I need the information for each class and each family separately. 

ADD REPLYlink written 5.1 years ago by M K490
0
gravatar for Devon Ryan
5.1 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

If you really want to use R (using mysql is actually faster), then you just want the split() command.

ADD COMMENTlink written 5.1 years ago by Devon Ryan94k

I don't have experience for using myaql, also I want all information for each class and family like chr name, strand,...etc. 

ADD REPLYlink written 5.1 years ago by M K490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 915 users visited in the last hour