Question: split fasta of a protein sequence into consecutive n amino acids
2
gravatar for arraychip
10 months ago by
arraychip30
arraychip30 wrote:

let's say I have a fasta of a protein sequence

> albumin

MKWVTFISLL FLFSSAYSRG ... ... ...

I want to split the sequence into all possible consecutive 8 amino acids (only in 1 direction, amino -> carboxyl) (And no looping(I don't know if it is the right expression), say, GMKWVTFIS)

I need

> fasta.albumin1

MKWVTFIS

>fasta.albumin2

KWVTFISL

> fasta.albumin3

WVTFISLLF

...


> fasta.albumin13

FSSAYSRG

And, I want to do this for all known human protein sequences.

How would I do it??? I need the result as a fasta or fasta files. And IDs for resulting 8-mer seuqeunces need to be unique.

fasta • 574 views
ADD COMMENTlink modified 10 months ago by Pierre Lindenbaum115k • written 10 months ago by arraychip30
1

Have you tried to do this yourself? Please post your code you have already tried.

ADD REPLYlink written 10 months ago by Sej Modha3.9k
1

I am a biologist. I can google to search for news, but can't, and don't know any scripting. I was even struggling with my posting questions. Sorry

ADD REPLYlink written 10 months ago by arraychip30

That is fine. This forum is there to help with that. I have formatted your post yet again. Please do not change the formatting back to plain text.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax59k
1

You say 5 AA in title but 8 in the body of post. Which is it?

There are ~20K+ validated human proteins at UniProt so results file is going to be gigantic.

Note: Why do you keep changing back the formatting of the data in your post to plain text?

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax59k

Yes, it's going to be a 'big data' I really want try.

ADD REPLYlink written 10 months ago by arraychip30

Hey @genomax

There is some issue with the code formatter below. A '(' is missing in my code print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer)) below but looks okay when I check in the edit mode.

ADD REPLYlink written 10 months ago by Sej Modha3.9k

There is a known issue with python code formatting and biostars display of that code. I think the fix is to put a comment there to add the ( at proper place. Not elegant but will work for now.

Edit: Best solution is to put your code in a gist and then link that in your post. Biostars then looks the code up via gist and formats it correctly.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax59k

I see! I will opt for the gist option next time. Thanks

ADD REPLYlink written 10 months ago by Sej Modha3.9k

If you know python/perl then this should be easy to do. Someone is bound to put a program up as answer shortly. @Pierre will likely have a one liner in awk.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax59k
3
gravatar for shenwei356
10 months ago by
shenwei3564.3k
China
shenwei3564.3k wrote:

There are many methods. For long terms, some programming skill is worth learning.

Here's a simple way by using seqkit.

$ seqkit sliding -s 1 -W 8 seqs.fa
>albumin_sliding:1-8
MKWVTFIS
>albumin_sliding:2-9
KWVTFISL
...
>albumin_sliding:13-20
FSSAYSRG
ADD COMMENTlink modified 10 months ago • written 10 months ago by shenwei3564.3k

OK, I will definitely try. Thanks a lot !!!!!

ADD REPLYlink written 10 months ago by arraychip30

This is the best solution +1

ADD REPLYlink written 4 months ago by SmallChess460
3
gravatar for Sej Modha
10 months ago by
Sej Modha3.9k
Glasgow, UK
Sej Modha3.9k wrote:

You requirement is to generate simple kmers from sequences. BioPython solution. Please change last print statement to: print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))

#!/usr/bin/env python3

from Bio import SeqIO

    myfile=SeqIO.parse('test.fa','fasta')
    for record in myfile:
        sequence=record.seq
        seq_len=len(sequence)
        #define the kmer len
        kmer=8
        count=0
        for seq in list(range(seq_len-(kmer-1))):
            count=count+1
            my_kmer=(sequence[seq:seq+kmer])
            print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))
ADD COMMENTlink modified 10 months ago • written 10 months ago by Sej Modha3.9k

I know a little bit of R. But maybe this is a good chance to study (Bio)Python. I will definitely try, once I study a little bit of BioPython. Thanks a lot

ADD REPLYlink written 10 months ago by arraychip30

Please accept the answers to indicate the question is resolved.

ADD REPLYlink written 10 months ago by Sej Modha3.9k
2
gravatar for Pierre Lindenbaum
10 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

Using bioalcidaejdk:

$ java -jar dist/bioalcidaejdk.jar -e 'final int N=8;stream().filter(F->F.length()>=N).forEach(F->IntStream.range(0,F.length()-N).forEach(L->{out.println(">"+F.getName()+"."+(L+1)+"\n"+F.subSequence(L,L+N));}));' input.fasta
ADD COMMENTlink written 10 months ago by Pierre Lindenbaum115k
2

I was hoping to see a awk one liner :-(

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1471 users visited in the last hour