Question: Split ordered fasta file by alphabetical group using awk
1
gravatar for Josh Herr
4 months ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

I know there are a ton of awk one-liners on here for splitting a fasta file, but here's one I have not been able to get to work or find an answer for.

A simple tool would be helpful, but I tried to use pyfasta and seqtk to no avail.

Please excuse me if someone has answered this one before, but I have googled and biostared for a while with no awk solution in sight.

A collaborator passed me a fasta file with the output from OrthoMCL - clustered genes in a single fasta file. A clustered group of genes in the file is listed alphabetically by the organism it was found in. Genes for some organisms are not present, so I can't split on the number of total organisms represented across all the fasta headers.

Any advice how to split a fasta file when the first two characters of the header is >A so that I have many fasta files where each clustered gene has it's own fasta file? There are multiple organisms with A as the first letter so I don't want to split just on A - I want to split before the first A in a series only.

awk one-liner fasta • 272 views
ADD COMMENTlink modified 4 months ago by cpad011210k • written 4 months ago by Josh Herr5.6k
1

@OP: Good description of data. An example fasta and expected output would be helpful.

ADD REPLYlink written 4 months ago by cpad011210k

Absolutely...

The original file is over 3 million sequences ordered like this:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA
>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA
>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA
>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and I am looking to parse the file like this:

file 1:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA

file 2:

>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA

file 3:

>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA

file 4:

>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and so on...

Note that the sequences are similar but not identical so I can't separate by sequence. I can recluster, but with millions of sequences I am trying to avoid this and quickly parse the file.

I'm fairly certain from the gene clustering that each section of each gene begins with an A taxa, but I am not 100% certain.

ADD REPLYlink modified 4 months ago • written 4 months ago by Josh Herr5.6k

Thanks. In short you would like to break every before A_sp and send it to a new file. @OP

ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011210k
2
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:
awk '($1==">A_sp") {n++;close(out);out=sprintf("chunk%d.fa",n);} {print >> out;}' input.fa
ADD COMMENTlink written 4 months ago by Pierre Lindenbaum114k
0
gravatar for cpad0112
4 months ago by
cpad011210k
India
cpad011210k wrote:

Try this: fa files would be numbered 1.fa,2.fa,3.fa,4 so on, It would create many files. Do it a different folder.

awk '/>A_sp / {i++}{print > i".fa"}' test.fa

For the example post above, it would create 4 files. or you can use csplit in following way:

$ csplit  -z -b "%01d.fa"  -f split test.fasta /\>A_sp\ / '{*}'

test.fasta is from OP. This would create 4 files with names: split0.fa, split1.fa and so on so forth.

ADD COMMENTlink modified 4 months ago • written 4 months ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1745 users visited in the last hour