Question: split fasta of a protein sequence into consecutive n amino acids
0
gravatar for arraychip
5 months ago by
arraychip10
arraychip10 wrote:

let's say I have a fasta of a protein sequence

> albumin

MKWVTFISLL FLFSSAYSRG ... ... ...

I want to split the sequence into all possible consecutive 8 amino acids (only in 1 direction, amino -> carboxyl) (And no looping(I don't know if it is the right expression), say, GMKWVTFIS)

I need

> fasta.albumin1

MKWVTFIS

>fasta.albumin2

KWVTFISL

> fasta.albumin3

WVTFISLLF

...


> fasta.albumin13

FSSAYSRG

And, I want to do this for all known human protein sequences.

How would I do it??? I need the result as a fasta or fasta files. And IDs for resulting 8-mer seuqeunces need to be unique.

fasta • 279 views
ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum110k • written 5 months ago by arraychip10
1

Have you tried to do this yourself? Please post your code you have already tried.

ADD REPLYlink written 5 months ago by Sej Modha3.1k
1

I am a biologist. I can google to search for news, but can't, and don't know any scripting. I was even struggling with my posting questions. Sorry

ADD REPLYlink written 5 months ago by arraychip10

That is fine. This forum is there to help with that. I have formatted your post yet again. Please do not change the formatting back to plain text.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax51k
1

You say 5 AA in title but 8 in the body of post. Which is it?

There are ~20K+ validated human proteins at UniProt so results file is going to be gigantic.

Note: Why do you keep changing back the formatting of the data in your post to plain text?

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax51k

Yes, it's going to be a 'big data' I really want try.

ADD REPLYlink written 5 months ago by arraychip10

Hey @genomax

There is some issue with the code formatter below. A '(' is missing in my code print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer)) below but looks okay when I check in the edit mode.

ADD REPLYlink written 5 months ago by Sej Modha3.1k

There is a known issue with python code formatting and biostars display of that code. I think the fix is to put a comment there to add the ( at proper place. Not elegant but will work for now.

Edit: Best solution is to put your code in a gist and then link that in your post. Biostars then looks the code up via gist and formats it correctly.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax51k

I see! I will opt for the gist option next time. Thanks

ADD REPLYlink written 5 months ago by Sej Modha3.1k

If you know python/perl then this should be easy to do. Someone is bound to put a program up as answer shortly. @Pierre will likely have a one liner in awk.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax51k
2
gravatar for Sej Modha
5 months ago by
Sej Modha3.1k
Glasgow, UK
Sej Modha3.1k wrote:

You requirement is to generate simple kmers from sequences. BioPython solution. Please change last print statement to: print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))

#!/usr/bin/env python3

from Bio import SeqIO

    myfile=SeqIO.parse('test.fa','fasta')
    for record in myfile:
        sequence=record.seq
        seq_len=len(sequence)
        #define the kmer len
        kmer=8
        count=0
        for seq in list(range(seq_len-(kmer-1))):
            count=count+1
            my_kmer=(sequence[seq:seq+kmer])
            print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))
ADD COMMENTlink modified 5 months ago • written 5 months ago by Sej Modha3.1k

I know a little bit of R. But maybe this is a good chance to study (Bio)Python. I will definitely try, once I study a little bit of BioPython. Thanks a lot

ADD REPLYlink written 5 months ago by arraychip10

Please accept the answers to indicate the question is resolved.

ADD REPLYlink written 5 months ago by Sej Modha3.1k
1
gravatar for shenwei356
5 months ago by
shenwei3564.0k
China
shenwei3564.0k wrote:

There are many methods. For long terms, some programming skill is worth learning.

Here's a simple way by using seqkit.

$ seqkit sliding -s 1 -W 8 seqs.fa
>albumin_sliding:1-8
MKWVTFIS
>albumin_sliding:2-9
KWVTFISL
...
>albumin_sliding:13-20
FSSAYSRG
ADD COMMENTlink modified 5 months ago • written 5 months ago by shenwei3564.0k

OK, I will definitely try. Thanks a lot !!!!!

ADD REPLYlink written 5 months ago by arraychip10
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum110k wrote:

Using bioalcidaejdk:

$ java -jar dist/bioalcidaejdk.jar -e 'final int N=8;stream().filter(F->F.length()>=N).forEach(F->IntStream.range(0,F.length()-N).forEach(L->{out.println(">"+F.getName()+"."+(L+1)+"\n"+F.subSequence(L,L+N));}));' input.fasta
ADD COMMENTlink written 5 months ago by Pierre Lindenbaum110k
2

I was hoping to see a awk one liner :-(

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1422 users visited in the last hour