Question

Remove duplicates in fasta files based on a specific value with awk

0

Entering edit mode

2.9 years ago

Mgs3 ▴ 30

I have a FASTA file organized as such:

>Prevalence_Sequence_ID:13|ARO_Name:AxyX|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAAGCAAAGAGTCCCTCTACGCACGTTCGTCCTATCTGCCGTATTAATTCTTATTACTGGTTGCTCGAAACCGGAAACCCAACCAGCCGCCGACGCCCCGGCGGAGAT
>Prevalence_Sequence_ID:14|ARO_Name:adeF|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAATATCTCGAAATTCTTCATCGACCGGCCGATCTTCGCCGGCGTGCTTTCGATCCTGGTGTTGCTGGCGGGCATACTGGCCATGTTCCAGCTGCCCATTTCCGAGTACCCGGAAGTGGTGCCGCCGTCGGTGGTGGTGCGCGCGCAGTATCCGGGCGCCAACCCCAAGGTCATCGCCGAAACCGTGGCCTCGCCGCTGGAGGAG

I need to remove sequences that share the same ARO code (such as those above), keeping only one. is there a simple solution to this problem using awk? In alternative, i can use python.

sort fasta sed awk • 1.4k views

ADD COMMENT • link updated 2.9 years ago by cpad0112 21k • written 2.9 years ago by Mgs3 ▴ 30

1

Entering edit mode

2.9 years ago

cpad0112 21k

with seqkit copy/pasted from here:

$ seqkit rmdup  --id-regexp "ARO:([0-9]+)" test.fa
$ seqkit rmdup  --id-regexp "ARO:(\d+)" test.fa

ADD COMMENT • link 2.9 years ago by cpad0112 21k

score 7 · Accepted Answer · 2021-09-08

7

Entering edit mode

2.9 years ago

Pierre Lindenbaum 163k

awk -F '|' '/^>/ {printf("%s%s,%s\t",(N>0?"\n":""),$3,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | sort -t, -k1,1 -u | cut -d, -f2- | tr "\t" "\n"

ADD COMMENT • link 2.9 years ago by Pierre Lindenbaum 163k

1

Entering edit mode

The awk-magician at it again :)

ADD REPLY • link 2.9 years ago by ATpoint 84k

1

Entering edit mode

This solution is perfect, i would very appreciate a simple explanation for it.

ADD REPLY • link 2.9 years ago by Mgs3 ▴ 30

2

Entering edit mode

1. awk -F '|' '/^>/ {printf("%s%s,%s\t",(N>0?"\n":""),$3,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa :

Prints 3 fields - string to be compared, sequences header and sequence it self. First delimiter is comma, second one is tab, in the output. While consuming input, field delimiter is "pipe |" and line should start with > (fasta header).

2.  sort -t, -k1,1 -u :

uses comma as delimiter, sorts on first field and prints unique lines based on first field

3.  cut -d, -f2-

Cuts field 2 onwards and delimiter for cutting is comma (this would keep sequence header and sequence)

4.  tr "\t" "\n" :

Replace tab with new line.

It can be further tightened:

$ awk -F '|' '/^>/ {printf("%s%s,%s\t",(N>0?"\n":""),$3,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' test.fa | sort -k1,1 -u | awk -F'\t' -v OFS="\n" '{print $2,$3}'

However this can be tricky as you would not have control over which sequence to be included (for eg. short vs long). In those cases, I would suggest datamash.

Let us say you would like to extract larger sequence (between/among duplicate records), use this (assuming that fasta sequences are in single line):

$ awk -F '|' -v OFS="\t" '/^>/ {getline seq} {print $3,$0,seq, length(seq)}' test.fa | datamash  -fs -g1 max 4 | awk -F "\t" -v OFS="\\n" '{print $2,$3}'

Let us say you would like to extract smaller sequence (between/among duplicate records), use this:

$ awk -F '|' -v OFS="\t" '/^>/ {getline seq} {print $3,$0,seq, length(seq)}' test.fa | datamash  -fs -g1 min 4 | awk -F "\t" -v OFS="\\n" '{print $2,$3}'

However, above code (from my post) works only if sequences are single line. For converting multi-line fasta records to single line fasta records (flattened format) , there are awk scripts or you can use seqkit.

ADD REPLY • link 2.9 years ago by cpad0112 21k