sequence splitting
1
0
Entering edit mode
2.1 years ago
zhichusun ▴ 10

I have a fasta file which contains multiple contigs

>DEHFGCMO_00205
>MDDGIGEH_00111
>FLCICGHF_00226
>FLCICGHF_00253
>DEHFGCMO_01539
>MDDGIGEH_00625

I want to split the contigs based on the first few letters of their names and aggregate them into different fasta files e.g. 1.fasta

>DEHFGCMO_00205 
>DEHFGCMO_01539

2.fasta

>MDDGIGEH_00111 
>MDDGIGEH_00625 

3.fasta

>FLCICGHF_00226 
>FLCICGHF_00253

what should I do? Very grateful for your help.

sequence • 817 views
ADD COMMENT
0
Entering edit mode

Assuming that sequences are single line and sequence names/ids follow similar pattern:

$ awk -F '[>_]' '/^>/ {getline seq;print $0"\n"seq > $2".fa"}' test.fa
ADD REPLY
1
Entering edit mode
2.1 years ago

seqkit split

$ seqkit split --by-id  --id-regexp "^(.+?)_" test.fasta -O result
[INFO] split by ID. idRegexp: ^(.+?)_
[INFO] read sequences ...
[INFO] read 6 sequences
[INFO] write 2 sequences to file: result/test.id_DEHFGCMO.fasta
[INFO] write 2 sequences to file: result/test.id_MDDGIGEH.fasta
[INFO] write 2 sequences to file: result/test.id_FLCICGHF.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6