R - How to convert a sequence in a fasta file into a matrix
3
0
Entering edit mode
19 months ago
kbaitsi • 0

I have a sequence stored in a fasta file, let's say ATCGATCGGGTTTAC and I want to create a matrix of six columns and 10 rows which include a sequnce of six, each time proceeding with a window of 1, meaning I want a matrix like this:

ATCGAT

TCGATC

CGATCG

GATCGG

ATCGGG

TCGGGT

CGGGTT

GGGTTT

GGTTTA

GTTTAC

Any thoughts on how can I do this using R?

R fasta matrix sequence part_of_an_exercise • 1.1k views
2
Entering edit mode
19 months ago
lars.snipen ▴ 30

You want to chop a sequence into 6-mers? Install the microseq-package,, and read the FASTA file with readFasta(). This produces a table of 2 text-columns (Header and Sequence). Then use the str_sub() function from the stringr-package (dependency of microseq) to collect the 6-mers. If your sequence above is seq, then str_sub(seq, 1:10, 5+(1:10)) should produce your 6-mers.

2
Entering edit mode
19 months ago

This can be easily done via substring().

yy <- "ATCGATCGGGTTTAC"
strs <- character(nchar(yy) - 5)
for (i in seq_along(strs)) {
strs[i] <- substring(yy, i, i + 5)
}

strs
[1] "ATCGAT" "TCGATC" "CGATCG" "GATCGG" "ATCGGG" "TCGGGT" "CGGGTT" "GGGTTT" "GGTTTA" "GTTTAC"

1
Entering edit mode

Today I learned that R has a substring function.

0
Entering edit mode

Thank you so much for your answer, I tried it and it worked. Can I use this if I have a really long sequence stored in a fasta file, let's say the entire genome of E.coli (more than 2500000 characters)? Also I need the results to be stored in a matrix with dimensions10X6 so I can further process them, if you have any suggestions :)

0
Entering edit mode

There are more efficient programs to generate k-mers from long sequences. Do you have to do this with R? Jellyfish (LINK) is one.

0
Entering edit mode

Yep unfortunately I have to use R, it's part of an exercise...

0
Entering edit mode

? If it's an exercise, are you not supposed to work on it, by your self? At least you should have tagged it as exercise.

0
Entering edit mode

I didn't know that, I will tag it now, thank you

1
Entering edit mode
19 months ago
> stringr::str_match_all("ATCGATCGGGTTTAC", "(?=(.{6}))")[[1]][,2]

[1] "ATCGAT" "TCGATC" "CGATCG" "GATCGG" "ATCGGG" "TCGGGT" "CGGGTT" "GGGTTT" "GGTTTA" "GTTTAC"


Outside R, with fasta as input and OP output with seqkit:

$seqkit sliding -s 1 -W 6 test.fa| seqkit fx2tab | cut -f2 ATCGAT TCGATC CGATCG GATCGG ATCGGG TCGGGT CGGGTT GGGTTT GGTTTA GTTTAC  input: $ cat test.fa
>seq1
ATCGATCGGGTTTAC

0
Entering edit mode

Thank you very much for your answer, I tried the first one and it works. How can I store the results in a matrix 10x6 so that I can use them for further analysis? Can I use this for really long sequences like the entire genome of E.coli (more than 2500000 characters)?

1
Entering edit mode

Converting a vector or list to a matrix in R is fairly trivial. You have thus far shown zero effort. Biostars is not a homework/code-writing service. Users might be willing to give hints on an exercise if significant effort is shown and if it's clear that the OP isn't just trying to get an answer. Had I known this was an assignment, I likely would not have answered given that you appear to have not even made an effort thus far.

0
Entering edit mode

I appreciate your comment but you seem to be judging me without knowing how much effort I have put into this for a lot of days now. I am new to R, I have spent hours trying to learn how to code with zero help. The course started a few weeks ago and I am supposed to learn how to code and complete a very difficult exercise in just a few days. I don't want anyone to do the exercise for me, plus what I am asking is part of the exercise, not the exercise itself. I am stuck for days and just reached for help. I am just trying to understand the basics. A link or an example would have been enough so that I can learn. Thanks for your time.

1
Entering edit mode
0
Entering edit mode

That's great! But we need to see that effort. Show us what you've tried and the specific issue(s) you're running into. This also has the benefit of forcing you to document what you've done and generate a reproducible example (which you've already done in this case, so kudos). In this process, you'll often realize your error or another approach to try.

I'm sorry if this came off harsh, but generating a matrix is very easy to find via google. Learning how to search and find the documentation you need is a critically important skill for programming, and it definitely requires experience and practice.

0
Entering edit mode

The exercise is to produce a pssm matrix for a GATA1 sequence(hexamer) and then use it to score the entire genome of E.coli. I have written the code needed to produce the pssm matrix (first the pwm matrix, then qa, then pssm, also checked it by hand to make sure) and I have also written with help from the book the function to calculate the score of a hexamer and how to apply it on a matrix of hexamers. The part I am stuck on isn't how to convert a list or a vector into a matrix, but how to create a matrix where each cell has one nucleotide and not a hexamer (that’s why I asked for a 10x6 matrix in my example, not any matrix). I know this might seem pretty easy to do but I don't know how to approach it. Thanks for your time once again.

0
Entering edit mode

Do you have to use a matrix of individual characters? Seems simpler to just use a list or vector of hexamers - it likely wouldn't require much of a change to your scoring function.

If you're dead set on splitting the hexamers, take a look at the strsplit function. Something like strsplit(x, "") applied to a vector of hexamers will get you a list for each. From there, you'll need the unlist, matrix, and t functions to get things how you want them. You can do it all in one line, but I think you should try to figure it out yourself.

0
Entering edit mode

You should have tried lars.snipen 's solution. It's easy to understand