Tutorial:Split a 'linearised' (flattened) FASTA sequence into multi-line using AWK
0
5
Entering edit mode
3.4 years ago

We have a FASTA sequence that is just the header and sequence on a single line:

cat fasta.fasta
> 1
AGTACGATCTACGTACGCAACTGAGCTACTACAGTCATGCTGACACTGACTGACACTGACTGACTGTGACACTGACTGCATGCTGCTGGCCCCGCAGTATCGACTGCGTACGTCGCGCGATTACGCGTACTGCGTCTGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGTGCACGTACTGATGCACATGCACTGA
> 2
TGACAGCTACTGACGTACGTACGTACGTCAGTACGTACGTACGTCAGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGCACTGCATGACTGACGTACGTACGTACGTACGT


We can use AWK to tidy this into lines of equal length, as follows:

awk -v len=40 -F "" '/^>/ {print}; !/^>/ {for (i=1; i<=NF; i++) {printf $(i); if (i % len == 0 || i == NF) printf "\n"}}' fasta.fasta > 1 AGTACGATCTACGTACGCAACTGAGCTACTACAGTCATGC TGACACTGACTGACACTGACTGACTGTGACACTGACTGCA TGCTGCTGGCCCCGCAGTATCGACTGCGTACGTCGCGCGA TTACGCGTACTGCGTCTGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGTGCACGTACTGATGCACATGCA CTGA > 2 TGACAGCTACTGACGTACGTACGTACGTCAGTACGTACGT ACGTCAGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTAGCACTGCATGACTGACGTACGT ACGTACGTACGT  awk -v len=10 -F "" '/^>/ {print}; !/^>/ {for (i=1; i<=NF; i++) {printf$(i); if (i % len == 0 || i == NF) printf "\n"}}' fasta.fasta
> 1
AGTACGATCT
ACGTACGCAA
CTGAGCTACT
ACAGTCATGC
TGACACTGAC
TGACACTGAC
TGACTGTGAC
ACTGACTGCA
TGCTGCTGGC
CCCGCAGTAT
CGACTGCGTA
CGTCGCGCGA
TTACGCGTAC
TGCGTCTGCA
TGCATGCATG
CATGCATGCA
TGCATGCATG
CATGCATGTG
CACGTACTGA
TGCACATGCA
CTGA
> 2
TGACAGCTAC
TGACGTACGT
ACGTACGTCA
GTACGTACGT
ACGTCAGTAC
GTTTTTTTTT
TTTTTTTTTT
TTTTTTTTTT
TTTTTTTTTT
TTTTTTTAGC
ACTGCATGAC
TGACGTACGT
ACGTACGTAC
GT


AWK doesn't have to be a one-liner, either:

awk -v len=80 -F "" '/^>/ {print};
!/^>/ {
for (i=1; i<=NF; i++) {
printf \$(i);
if (i % len == 0 || i == NF)
printf "\n"
}
}' fasta.fasta

> 1
AGTACGATCTACGTACGCAACTGAGCTACTACAGTCATGCTGACACTGACTGACACTGACTGACTGTGACACTGACTGCA
TGCTGCTGGCCCCGCAGTATCGACTGCGTACGTCGCGCGATTACGCGTACTGCGTCTGCATGCATGCATGCATGCATGCA
TGCATGCATGCATGCATGTGCACGTACTGATGCACATGCACTGA
> 2
TGACAGCTACTGACGTACGTACGTACGTCAGTACGTACGTACGTCAGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTAGCACTGCATGACTGACGTACGTACGTACGTACGT


Kevin

fasta awk • 1.5k views
3
Entering edit mode

To linearize fasta use @Pierre's code so then you can use @Kevin's code