Split ordered fasta file by alphabetical group using awk
2
1
Entering edit mode
5.8 years ago
Josh Herr 5.8k

I know there are a ton of awk one-liners on here for splitting a fasta file, but here's one I have not been able to get to work or find an answer for.

A simple tool would be helpful, but I tried to use pyfasta and seqtk to no avail.

Please excuse me if someone has answered this one before, but I have googled and biostared for a while with no awk solution in sight.

A collaborator passed me a fasta file with the output from OrthoMCL - clustered genes in a single fasta file. A clustered group of genes in the file is listed alphabetically by the organism it was found in. Genes for some organisms are not present, so I can't split on the number of total organisms represented across all the fasta headers.

Any advice how to split a fasta file when the first two characters of the header is >A so that I have many fasta files where each clustered gene has it's own fasta file? There are multiple organisms with A as the first letter so I don't want to split just on A - I want to split before the first A in a series only.

awk fasta one-liner • 1.7k views
ADD COMMENT
1
Entering edit mode

@OP: Good description of data. An example fasta and expected output would be helpful.

ADD REPLY
0
Entering edit mode

Absolutely...

The original file is over 3 million sequences ordered like this:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA
>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA
>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA
>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and I am looking to parse the file like this:

file 1:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA

file 2:

>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA

file 3:

>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA

file 4:

>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and so on...

Note that the sequences are similar but not identical so I can't separate by sequence. I can recluster, but with millions of sequences I am trying to avoid this and quickly parse the file.

I'm fairly certain from the gene clustering that each section of each gene begins with an A taxa, but I am not 100% certain.

ADD REPLY
0
Entering edit mode

Thanks. In short you would like to break every before A_sp and send it to a new file. @OP

ADD REPLY
2
Entering edit mode
5.8 years ago
awk '($1==">A_sp") {n++;close(out);out=sprintf("chunk%d.fa",n);} {print >> out;}' input.fa
ADD COMMENT
0
Entering edit mode
5.8 years ago

Try this: fa files would be numbered 1.fa,2.fa,3.fa,4 so on, It would create many files. Do it a different folder.

awk '/>A_sp / {i++}{print > i".fa"}' test.fa

For the example post above, it would create 4 files. or you can use csplit in following way:

$ csplit  -z -b "%01d.fa"  -f split test.fasta /\>A_sp\ / '{*}'

test.fasta is from OP. This would create 4 files with names: split0.fa, split1.fa and so on so forth.

ADD COMMENT

Login before adding your answer.

Traffic: 2695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6