Question: Help using data.frame in R
gravatar for Noob
2.2 years ago by
Noob0 wrote:

I have a .fa file that is the result from extracting gene sequences from a repeat masker coordinate file on the genome I am working with. However, each input looks as such:


I would like to change the title lines of all the inputs to look something like:


This includes the original name and then the location from the repeat masker bed file. I was told using data.frame would be useful in R to make this adjustment but I am not aware really how to go about this. Any help?

sequencing gene sequence R genome • 566 views
ADD COMMENTlink modified 2.2 years ago by genomax86k • written 2.2 years ago by Noob0

Do you work in Linux? It seems that your problem is more easily solved using sed or any other command-line tool than R data.frame. Sure you can import those files in R as a data.frame, but I can't see how that could facilitate things...Could you provide more information about your data structure? For example, do you have two files with identifiers you'd like to combine, or something like that?

ADD REPLYlink written 2.2 years ago by Solowars50

I am being asked to write a script for this so I am not really sure to be honest. I just need to write a script that is a middle point between the bedtools output from the top and the new modified version.

ADD REPLYlink written 2.2 years ago by Noob0

Okay, can you provide a brief example of your data (i.e. 3-4 entries you'd like to combine) so I can help you build the script? I'm sure there's a way to do it in R, but as I said before, it's probably easier using Unix tools, if you have a computer with Linux/Mac at hand.

ADD REPLYlink written 2.2 years ago by Solowars50

this is the bedtools output:


This is the repeat masker coordinates that the above were derived from:

KB824701.1 417 478 rnd-5_family-5445_Unspecified . -

KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -

KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -

KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -

and ideally I would like it to be in the format of this, for example using the first one:

rnd-5_family-5445_Unspecified(-)" to ">rnd-5_family-5445_KB824701.1_417_478

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Noob0
gravatar for phytobio
2.2 years ago by
San Diego CA, Scripps Institution of Oceanography
phytobio30 wrote:

I would agree with Solowars comment but if you are interested in solving this problem in R.....

I haven't found an easy way to get a fasta file into a dataframe but the seqinr package in R is really useful for working with fasta files. After reading in your fasta file (read.fasta()) you can generate a vector of the sequence IDs using the names() function. gsub can then be used to modify the vector and you can save a new version of the fasta file with the modified names (write.fasta()).

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by phytobio30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1942 users visited in the last hour