Question

How to remove all-N sequence entries from fasta file.

0

Entering edit mode

3.0 years ago

Info.shi ▴ 30

I want to remove entries of a fasta file where all nucleotides are N, but not entries that contain ATGC and N nucleotides.

from an example-

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC

Output file content to be:

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC

Thank you!

awk perl • 2.6k views

ADD COMMENT • link updated 3.0 years ago by cpad0112 21k • written 3.0 years ago by Info.shi ▴ 30

1

Entering edit mode

Two ways with grep (for flat fasta files) and cutadapt:

$ grep -i -B 1 --no-group-separator  '[ATGC]' test.fa

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC



$ cutadapt --quiet --max-n 0.99 test.fa

>Seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>Seq_3
ATGGCCTTAGTACCAGCTACAAACAACACGNNNAGCNGCCGAC

ADD REPLY • link 3.0 years ago by cpad0112 21k

score 1 · Answer 1 · 2021-04-21

1

Entering edit mode

3.0 years ago

GenoMax 141k

Use reformat.sh from BBMap suite. Set NN to the a high enough number.

reformat.sh -Xmx2g in=your.fa out=clean.fa maxns=NN

ADD COMMENT • link 3.0 years ago by GenoMax 141k

score 1 · Answer 2 · 2021-04-21

cat foo.fa
>chrA
NNNNNNNN
>chrB
TAGCTACGNNATCGNNN
>chrC
TAGCTGACATCG

samtools faidx foo.fa
seqtk comp foo.fa \
| awk '{c=0;for(i=3;i<=6;++i){c+=$i}; gsub(">",""); if(c>0) print $1}' \
| xargs samtools faidx foo.fa

>chrB
TAGCTACGNNATCGNNN
>chrC
TAGCTGACATCG

Explanation: seqtk comp determines the number of A/T/C/G in each sequence (that is the columns 3-6 of its output). awk then will print those chromosome names where the sum of these are zero (so where only non A/T/C/G where in the sequence), and samtools faidx then fetches from the fasta those entires that awk returned. The initial faidx command is used to index the fasta file.

score 1 · Answer 3 · 2021-04-21

1

Entering edit mode

3.0 years ago

Pierre Lindenbaum 161k

linearize, filter and reformat with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '!($2 ~ /^[N]*$/) {printf("%s\n%s\n",$1,$2);}'

ADD COMMENT • link 3.0 years ago by Pierre Lindenbaum 161k

score 1 · Answer 4 · 2021-04-21

You can run gaas_fasta_purify.pl from GAAS

It will

Remove pure Ns sequences
lowercase nucleotides: There are considered as repeat by annotation tool, while most assemblers consider them as region with low coverage. Fixed by the tool.
Leading and trailing Ns sequences are forbidden for submission to ENA. Trimmed by the tool.
Long internal N regions (> 10000) is problematic in order to run Genemark. The tool inform the user.
IUPAC code might be problematic for some tools. e.g, an extra option is mandatory for MAKER. The tool inform the user.
Short sequences can be useless and time consuming for annotation tools. The tool remove them. Default 1000 bp.
Long sequence in single line fasta might raise issues some tools. The tool fold them (80 characters).