Question: Split ordered fasta file by alphabetical group using awk
1
gravatar for Josh Herr
10 weeks ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

I know there are a ton of awk one-liners on here for splitting a fasta file, but here's one I have not been able to get to work or find an answer for.

A simple tool would be helpful, but I tried to use pyfasta and seqtk to no avail.

Please excuse me if someone has answered this one before, but I have googled and biostared for a while with no awk solution in sight.

A collaborator passed me a fasta file with the output from OrthoMCL - clustered genes in a single fasta file. A clustered group of genes in the file is listed alphabetically by the organism it was found in. Genes for some organisms are not present, so I can't split on the number of total organisms represented across all the fasta headers.

Any advice how to split a fasta file when the first two characters of the header is >A so that I have many fasta files where each clustered gene has it's own fasta file? There are multiple organisms with A as the first letter so I don't want to split just on A - I want to split before the first A in a series only.

awk one-liner fasta • 199 views
ADD COMMENTlink modified 10 weeks ago by cpad01128.8k • written 10 weeks ago by Josh Herr5.6k
1

@OP: Good description of data. An example fasta and expected output would be helpful.

ADD REPLYlink written 10 weeks ago by cpad01128.8k

Absolutely...

The original file is over 3 million sequences ordered like this:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA
>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA
>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA
>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and I am looking to parse the file like this:

file 1:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA

file 2:

>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA

file 3:

>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA

file 4:

>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and so on...

Note that the sequences are similar but not identical so I can't separate by sequence. I can recluster, but with millions of sequences I am trying to avoid this and quickly parse the file.

I'm fairly certain from the gene clustering that each section of each gene begins with an A taxa, but I am not 100% certain.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Josh Herr5.6k

Thanks. In short you would like to break every before A_sp and send it to a new file. @OP

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by cpad01128.8k
2
gravatar for Pierre Lindenbaum
10 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:
awk '($1==">A_sp") {n++;close(out);out=sprintf("chunk%d.fa",n);} {print >> out;}' input.fa
ADD COMMENTlink written 10 weeks ago by Pierre Lindenbaum112k
0
gravatar for cpad0112
10 weeks ago by
cpad01128.8k
India
cpad01128.8k wrote:

Try this: fa files would be numbered 1.fa,2.fa,3.fa,4 so on, It would create many files. Do it a different folder.

awk '/>A_sp / {i++}{print > i".fa"}' test.fa

For the example post above, it would create 4 files. or you can use csplit in following way:

$ csplit  -z -b "%01d.fa"  -f split test.fasta /\>A_sp\ / '{*}'

test.fasta is from OP. This would create 4 files with names: split0.fa, split1.fa and so on so forth.

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by cpad01128.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1296 users visited in the last hour