Question

Help using data.frame in R

0

Entering edit mode

7.2 years ago

Noob • 0

I have a .fa file that is the result from extracting gene sequences from a repeat masker coordinate file on the genome I am working with. However, each input looks as such:

>rnd-5_family-5445_Unspecified(-)
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG

I would like to change the title lines of all the inputs to look something like:

>rnd-5_family-5445_KB824701.1_417_478

This includes the original name and then the location from the repeat masker bed file. I was told using data.frame would be useful in R to make this adjustment but I am not aware really how to go about this. Any help?

R genome sequence gene sequencing • 1.9k views

ADD COMMENT • link updated 7.2 years ago by GenoMax 152k • written 7.2 years ago by Noob • 0

2

Entering edit mode

Do you work in Linux? It seems that your problem is more easily solved using sed or any other command-line tool than R data.frame. Sure you can import those files in R as a data.frame, but I can't see how that could facilitate things...Could you provide more information about your data structure? For example, do you have two files with identifiers you'd like to combine, or something like that?

ADD REPLY • link 7.2 years ago by Solowars ▴ 70

0

Entering edit mode

I am being asked to write a script for this so I am not really sure to be honest. I just need to write a script that is a middle point between the bedtools output from the top and the new modified version.

ADD REPLY • link 7.2 years ago by Noob • 0

0

Entering edit mode

Okay, can you provide a brief example of your data (i.e. 3-4 entries you'd like to combine) so I can help you build the script? I'm sure there's a way to do it in R, but as I said before, it's probably easier using Unix tools, if you have a computer with Linux/Mac at hand.

ADD REPLY • link 7.2 years ago by Solowars ▴ 70

0

Entering edit mode

this is the bedtools output:

rnd-5_family-5445_Unspecified(-) TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG rnd-5_family-2614_Unspecified(-) AATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTTGAAGGCAACACGAATCCAGAGTGATGCTTTACATAAATCTGCTTCTACCGATGCCAAAAATTGACGATATTCTATTATTTAATCTAAATGTTAGTCTTTACATACCCTCCCCTAATTGTTAGAATTTTATGAAATTTGATTTCAGGGGTCAGTTTAGCATGCTAAATCTAATTCAATAGATTGATATTTTTCTTCAGGTAAAGAAAATTTTTGCGTCAAAGTAATCATATTCCTCCACGATTGCATATAACTATGGTATATAATTTAAAAGATTACACTTTACGTAATGAAAAATCGGCCAATCATTCAAAAGTTATGAAAGTGATC rnd-5_family-2614_Unspecified(-) TTTCTAGTTTGTATTTGAGGAAAGAAAATCCTACTATCACTTTTTAAAAAATTCACCAATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTT rnd-4_family-798_Unspecified(-) ATGCTTAATTTGCTAAGCATTTAATAACCTTTTTGGGTTTTGTGATAAAGGATGCTGTAGACATTAAAATAAACCTTATACTGCTAT

This is the repeat masker coordinates that the above were derived from:

KB824701.1 417 478 rnd-5_family-5445_Unspecified . -

KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -

KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -

KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -

and ideally I would like it to be in the format of this, for example using the first one:

rnd-5_family-5445_Unspecified(-)" to ">rnd-5_family-5445_KB824701.1_417_478

ADD REPLY • link 7.2 years ago by Noob • 0

score 2 · Answer 1 · 2018-04-18

I would agree with Solowars comment but if you are interested in solving this problem in R.....

I haven't found an easy way to get a fasta file into a dataframe but the seqinr package in R is really useful for working with fasta files. After reading in your fasta file (read.fasta()) you can generate a vector of the sequence IDs using the names() function. gsub can then be used to modify the vector and you can save a new version of the fasta file with the modified names (write.fasta()).