Question: R - How to convert a sequence in a fasta file into a matrix
0
gravatar for kbaitsi
4 weeks ago by
kbaitsi0
kbaitsi0 wrote:

I have a sequence stored in a fasta file, let's say ATCGATCGGGTTTAC and I want to create a matrix of six columns and 10 rows which include a sequnce of six, each time proceeding with a window of 1, meaning I want a matrix like this:

ATCGAT

TCGATC

CGATCG

GATCGG

ATCGGG

TCGGGT

CGGGTT

GGGTTT

GGTTTA

GTTTAC

Any thoughts on how can I do this using R?

Thanks in advance.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by kbaitsi0
2
gravatar for lars.snipen
4 weeks ago by
lars.snipen30
lars.snipen30 wrote:

You want to chop a sequence into 6-mers? Install the microseq-package,, and read the FASTA file with readFasta(). This produces a table of 2 text-columns (Header and Sequence). Then use the str_sub() function from the stringr-package (dependency of microseq) to collect the 6-mers. If your sequence above is seq, then str_sub(seq, 1:10, 5+(1:10)) should produce your 6-mers.

ADD COMMENTlink written 4 weeks ago by lars.snipen30
2
gravatar for jared.andrews07
4 weeks ago by
jared.andrews077.9k
Memphis, TN
jared.andrews077.9k wrote:

This can be easily done via substring().

yy <- "ATCGATCGGGTTTAC"
strs <- character(nchar(yy) - 5)
for (i in seq_along(strs)) {
    strs[i] <- substring(yy, i, i + 5)
}

strs
 [1] "ATCGAT" "TCGATC" "CGATCG" "GATCGG" "ATCGGG" "TCGGGT" "CGGGTT" "GGGTTT" "GGTTTA" "GTTTAC"
ADD COMMENTlink written 4 weeks ago by jared.andrews077.9k
1

Today I learned that R has a substring function.

ADD REPLYlink written 4 weeks ago by bioinformatics2020520

Thank you so much for your answer, I tried it and it worked. Can I use this if I have a really long sequence stored in a fasta file, let's say the entire genome of E.coli (more than 2500000 characters)? Also I need the results to be stored in a matrix with dimensions10X6 so I can further process them, if you have any suggestions :)

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by kbaitsi0

There are more efficient programs to generate k-mers from long sequences. Do you have to do this with R? Jellyfish (LINK) is one.

ADD REPLYlink written 4 weeks ago by genomax92k

Yep unfortunately I have to use R, it's part of an exercise...

ADD REPLYlink written 4 weeks ago by kbaitsi0

? If it's an exercise, are you not supposed to work on it, by your self? At least you should have tagged it as exercise.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by cpad011214k

I didn't know that, I will tag it now, thank you

ADD REPLYlink written 4 weeks ago by kbaitsi0
1
gravatar for cpad0112
4 weeks ago by
cpad011214k
Hyderabad India
cpad011214k wrote:
> stringr::str_match_all("ATCGATCGGGTTTAC", "(?=(.{6}))")[[1]][,2]

[1] "ATCGAT" "TCGATC" "CGATCG" "GATCGG" "ATCGGG" "TCGGGT" "CGGGTT" "GGGTTT" "GGTTTA" "GTTTAC"

Outside R, with fasta as input and OP output with seqkit:

$ seqkit sliding -s 1 -W 6 test.fa| seqkit fx2tab | cut -f2
ATCGAT
TCGATC
CGATCG
GATCGG
ATCGGG
TCGGGT
CGGGTT
GGGTTT
GGTTTA
GTTTAC

input:

$ cat test.fa 
>seq1
ATCGATCGGGTTTAC
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by cpad011214k

Thank you very much for your answer, I tried the first one and it works. How can I store the results in a matrix 10x6 so that I can use them for further analysis? Can I use this for really long sequences like the entire genome of E.coli (more than 2500000 characters)?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by kbaitsi0
1

Converting a vector or list to a matrix in R is fairly trivial. You have thus far shown zero effort. Biostars is not a homework/code-writing service. Users might be willing to give hints on an exercise if significant effort is shown and if it's clear that the OP isn't just trying to get an answer. Had I known this was an assignment, I likely would not have answered given that you appear to have not even made an effort thus far.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by jared.andrews077.9k

I appreciate your comment but you seem to be judging me without knowing how much effort I have put into this for a lot of days now. I am new to R, I have spent hours trying to learn how to code with zero help. The course started a few weeks ago and I am supposed to learn how to code and complete a very difficult exercise in just a few days. I don't want anyone to do the exercise for me, plus what I am asking is part of the exercise, not the exercise itself. I am stuck for days and just reached for help. I am just trying to understand the basics. A link or an example would have been enough so that I can learn. Thanks for your time.

ADD REPLYlink written 4 weeks ago by kbaitsi0
1

Look into the matrix function: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/matrix

ADD REPLYlink written 4 weeks ago by bioinformatics2020520

That's great! But we need to see that effort. Show us what you've tried and the specific issue(s) you're running into. This also has the benefit of forcing you to document what you've done and generate a reproducible example (which you've already done in this case, so kudos). In this process, you'll often realize your error or another approach to try.

I'm sorry if this came off harsh, but generating a matrix is very easy to find via google. Learning how to search and find the documentation you need is a critically important skill for programming, and it definitely requires experience and practice.

ADD REPLYlink written 4 weeks ago by jared.andrews077.9k

The exercise is to produce a pssm matrix for a GATA1 sequence(hexamer) and then use it to score the entire genome of E.coli. I have written the code needed to produce the pssm matrix (first the pwm matrix, then qa, then pssm, also checked it by hand to make sure) and I have also written with help from the book the function to calculate the score of a hexamer and how to apply it on a matrix of hexamers. The part I am stuck on isn't how to convert a list or a vector into a matrix, but how to create a matrix where each cell has one nucleotide and not a hexamer (that’s why I asked for a 10x6 matrix in my example, not any matrix). I know this might seem pretty easy to do but I don't know how to approach it. Thanks for your time once again.

ADD REPLYlink written 4 weeks ago by kbaitsi0

Do you have to use a matrix of individual characters? Seems simpler to just use a list or vector of hexamers - it likely wouldn't require much of a change to your scoring function.

If you're dead set on splitting the hexamers, take a look at the strsplit function. Something like strsplit(x, "") applied to a vector of hexamers will get you a list for each. From there, you'll need the unlist, matrix, and t functions to get things how you want them. You can do it all in one line, but I think you should try to figure it out yourself.

ADD REPLYlink written 4 weeks ago by jared.andrews077.9k

You should have tried lars.snipen 's solution. It's easy to understand

ADD REPLYlink written 4 weeks ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2122 users visited in the last hour