Divide each fasta sequence into fragments of 200 nucleotides
2
0
Entering edit mode
7.1 years ago

I have a multi fasta seq

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA
AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA

>>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTG
AATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGG
GCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGC
ATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAA

Divide each fasta sequence into fragments of 200 nucleotides

gene fasta • 1.5k views
ADD COMMENT
1
Entering edit mode

Not clear what kind of fragments do you want?

For example, for fragment length = 3,

$ echo -e ">seq\nacgtnACGTN"                           
>seq
acgtnACGTN

Do you want this:

$ echo -e ">seq\nacgtnACGTN" | seqkit sliding -W 3 -s 3 -g
>seq_sliding:1-3
acg
>seq_sliding:4-6
tnA
>seq_sliding:7-9
CGT
>seq_sliding:10-12
N

Or this

$ echo -e ">seq\nacgtnACGTN" | seqkit sliding -W 3 -s 1   
>seq_sliding:1-3
acg
>seq_sliding:2-4
cgt
>seq_sliding:3-5
gtn
>seq_sliding:4-6
tnA
>seq_sliding:5-7
nAC
>seq_sliding:6-8
ACG
>seq_sliding:7-9
CGT
>seq_sliding:8-10
GTN
ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
7.1 years ago
James Ashmore ★ 3.4k

The seqkit package provides an answer:

seqkit sliding -s 200 -W 200 input.fasta > output.fasta
ADD COMMENT
1
Entering edit mode

Actually, option -s/--step is needed.

ADD REPLY
0
Entering edit mode

Ah yes thanks for correcting me

ADD REPLY
0
Entering edit mode
7.1 years ago

awk

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' inpu.fa |\
 awk -F '\t' '{x=200;S=1;L=length($2);while(S<=L) { printf(">%s (%d-%d)\n%s\n",$1,S,S+x,substr($2,S,x)); S+=(x+1);}}'
ADD COMMENT

Login before adding your answer.

Traffic: 3389 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6