Question

Help finding error in fasta file

0

Entering edit mode

9 months ago

nreynolds • 0

Hello all,

I am making a reference data base that must have the following format for all sequences:

>MG8515941;tax=k:SAR,p:Alveolata,c:Dinophyceae,o:Suessiales,f:Borghiellaceae,g:Borghiella,s:Borghiella sp
TGGCGAATGAACAGGGACAAGCTCGGCATGGAAATTGGGGCCTCTGGCCTTGAATTGTAGCCTCGAGAAG

Somewhere there is at least one error that is causing the file to not be recognized as a fasta file by the amplicon sequencing pipeline I'm using (AMPtk). I have tried searching for every error motif I can think of (using textedit), but I can't find the problem. I think the most likely issue is that one (or more) of the sequences is missing a hard return after the taxonomy string.

The file (RDPSILVA_LSUdatabase_error.fasta) is here https://osf.io/cz3mh/

Any suggestions for how I can find the error(s) without going through the file line by line?

fasta • 649 views

ADD COMMENT • link updated 8 months ago by Ram 45k • written 9 months ago by nreynolds • 0

score 1 · Answer 1 · 2024-12-21

using

awk '!((NR%2==1 && $0 ~ /^>/) || (NR%2==0 && $0 ~ /^[ACGTURYSWKMBDHVN]+$/)) {print NR,$0;}' in.fa

shows that record at line 5363 is empty.

furthermore, some of your lines end with CRLF:

 cat RDPSILVA_LSUdatabase_error.fasta  | grep -v '^>' | grep -vE  '^[ACGTURYSWKMBDHVN]+$' | file -
/dev/stdin: ASCII text, with very long lines (1376), with CRLF line terminators