Help using data.frame in R
1
0
Entering edit mode
6.0 years ago
Noob • 0

I have a .fa file that is the result from extracting gene sequences from a repeat masker coordinate file on the genome I am working with. However, each input looks as such:

>rnd-5_family-5445_Unspecified(-)
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG

I would like to change the title lines of all the inputs to look something like:

>rnd-5_family-5445_KB824701.1_417_478

This includes the original name and then the location from the repeat masker bed file. I was told using data.frame would be useful in R to make this adjustment but I am not aware really how to go about this. Any help?

R genome sequence gene sequencing • 1.4k views
ADD COMMENT
2
Entering edit mode

Do you work in Linux? It seems that your problem is more easily solved using sed or any other command-line tool than R data.frame. Sure you can import those files in R as a data.frame, but I can't see how that could facilitate things...Could you provide more information about your data structure? For example, do you have two files with identifiers you'd like to combine, or something like that?

ADD REPLY
0
Entering edit mode

I am being asked to write a script for this so I am not really sure to be honest. I just need to write a script that is a middle point between the bedtools output from the top and the new modified version.

ADD REPLY
0
Entering edit mode

Okay, can you provide a brief example of your data (i.e. 3-4 entries you'd like to combine) so I can help you build the script? I'm sure there's a way to do it in R, but as I said before, it's probably easier using Unix tools, if you have a computer with Linux/Mac at hand.

ADD REPLY
0
Entering edit mode

this is the bedtools output:

rnd-5_family-5445_Unspecified(-) TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG rnd-5_family-2614_Unspecified(-) AATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTTGAAGGCAACACGAATCCAGAGTGATGCTTTACATAAATCTGCTTCTACCGATGCCAAAAATTGACGATATTCTATTATTTAATCTAAATGTTAGTCTTTACATACCCTCCCCTAATTGTTAGAATTTTATGAAATTTGATTTCAGGGGTCAGTTTAGCATGCTAAATCTAATTCAATAGATTGATATTTTTCTTCAGGTAAAGAAAATTTTTGCGTCAAAGTAATCATATTCCTCCACGATTGCATATAACTATGGTATATAATTTAAAAGATTACACTTTACGTAATGAAAAATCGGCCAATCATTCAAAAGTTATGAAAGTGATC rnd-5_family-2614_Unspecified(-) TTTCTAGTTTGTATTTGAGGAAAGAAAATCCTACTATCACTTTTTAAAAAATTCACCAATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTT rnd-4_family-798_Unspecified(-) ATGCTTAATTTGCTAAGCATTTAATAACCTTTTTGGGTTTTGTGATAAAGGATGCTGTAGACATTAAAATAAACCTTATACTGCTAT

This is the repeat masker coordinates that the above were derived from:

KB824701.1 417 478 rnd-5_family-5445_Unspecified . -

KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -

KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -

KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -

and ideally I would like it to be in the format of this, for example using the first one:

rnd-5_family-5445_Unspecified(-)" to ">rnd-5_family-5445_KB824701.1_417_478

ADD REPLY
2
Entering edit mode
6.0 years ago
phytobio ▴ 30

I would agree with Solowars comment but if you are interested in solving this problem in R.....

I haven't found an easy way to get a fasta file into a dataframe but the seqinr package in R is really useful for working with fasta files. After reading in your fasta file (read.fasta()) you can generate a vector of the sequence IDs using the names() function. gsub can then be used to modify the vector and you can save a new version of the fasta file with the modified names (write.fasta()).

ADD COMMENT

Login before adding your answer.

Traffic: 2581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6