Help finding error in fasta file
1
0
Entering edit mode
9 months ago
nreynolds • 0

Hello all,

I am making a reference data base that must have the following format for all sequences:

>MG8515941;tax=k:SAR,p:Alveolata,c:Dinophyceae,o:Suessiales,f:Borghiellaceae,g:Borghiella,s:Borghiella sp
TGGCGAATGAACAGGGACAAGCTCGGCATGGAAATTGGGGCCTCTGGCCTTGAATTGTAGCCTCGAGAAG

Somewhere there is at least one error that is causing the file to not be recognized as a fasta file by the amplicon sequencing pipeline I'm using (AMPtk). I have tried searching for every error motif I can think of (using textedit), but I can't find the problem. I think the most likely issue is that one (or more) of the sequences is missing a hard return after the taxonomy string.

The file (RDPSILVA_LSUdatabase_error.fasta) is here https://osf.io/cz3mh/

Any suggestions for how I can find the error(s) without going through the file line by line?

fasta • 649 views
ADD COMMENT
1
Entering edit mode
9 months ago

using

awk '!((NR%2==1 && $0 ~ /^>/) || (NR%2==0 && $0 ~ /^[ACGTURYSWKMBDHVN]+$/)) {print NR,$0;}' in.fa

shows that record at line 5363 is empty.

furthermore, some of your lines end with CRLF:

 cat RDPSILVA_LSUdatabase_error.fasta  | grep -v '^>' | grep -vE  '^[ACGTURYSWKMBDHVN]+$' | file -
/dev/stdin: ASCII text, with very long lines (1376), with CRLF line terminators
ADD COMMENT

Login before adding your answer.

Traffic: 5560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6