bedtools getfasta chromosome not found
1
1
Entering edit mode
3.3 years ago

I am getting a curious error when using bedtools to extract sequences from a reference fasta and a gff3 file.

I enter: /gpfs/gwngs/tools/bedtools2/bin/bedtools getfasta -s -fullHeader -fi ref.fa -bed ref.gff3

And I get a million lines saying, for example:

WARNING. chromosome (CP014594) was not found in the FASTA file. Skipping.

The thing is I grepped the fasta header for this chromosome and got:

>CP014594

There are no white spaces in the line. what's going on? i have tried inserting the prefix 'chr' into both files, and inserting a space between > and the chromosome name for the fasta file.

software error sequencing bedtools • 2.0k views
ADD COMMENT
0
Entering edit mode

what is the output of

file ref.fa

?

ADD REPLY
0
Entering edit mode

ref.fa: ASCII text, with CRLF line terminators

ADD REPLY
0
Entering edit mode

00000000 54 47 47 54 47 47 41 54 47 43 54 47 0a 3e 43 50 |TGGTGGATGCTG.>CP| 00000010 30 31 34 35 39 34 0a 41 43 41 47 43 43 47 41 43 |014594.ACAGCCGAC| 00000020 41 41 43 43 43 41 41 43 41 54 47 43 43 41 41 41 |AACCCAACATGCCAAA| 00000030 43 54 43 43 41 47 41 43 54 43 47 41 41 43 43 54 |CTCCAGACTCGAACCT| 00000040 47 47 47 41 43 54 43 43 41 41 47 41 41 54 43 41 |GGGACTCCAAGAATCA| 00000050 41 41 43 0a |AAC.| 00000054

and

00000000 43 50 30 31 34 35 39 34 09 2e 09 67 65 6e 65 09 |CP014594...gene.| 00000010 31 30 34 36 31 09 31 34 36 35 38 09 2e 09 2b 09 |10461.14658...+.| 00000020 2e 09 4e 61 6d 65 3d 41 54 59 34 30 5f 72 52 4e |..Name=ATY40_rRN| 00000030 41 63 53 43 37 73 31 30 34 36 31 65 31 34 36 35 |AcSC7s10461e1465| 00000040 38 3b 6c 6f 63 75 73 5f 74 61 67 3d 22 41 54 59 |8;locus_tag="ATY| 00000050 34 30 5f 72 52 4e 41 63 53 43 37 73 31 30 34 36 |40_rRNAcSC7s1046| 00000060 31 65 31 34 36 35 38 22 0a |1e14658".| 00000069

ADD REPLY
0
Entering edit mode
3.3 years ago

your fasta is a windows file . https://en.wikipedia.org/wiki/Newline#Frequent_issues

remove the invisible '\r' with

tr -d '\r' < ref.fa > tmp.fa
mv tmp.fa ref.fa

and try again.

ADD COMMENT
1
Entering edit mode

alternative command line fix: use the dos2unix command (dos2unix yourfile.fa)

ADD REPLY
0
Entering edit mode

good catch! unfortunately, it still doesn't work, in spite of both files now being ASCII text with no \r .

ADD REPLY
0
Entering edit mode

hum... output of

grep CP014594 -B 1 -A 1 ref.fa | hexdump -C ?

and

grep -w CP014594 ref.gff3 | tail -1  | hexdump -C ?
ADD REPLY
0
Entering edit mode

for some reason my answer posted above, i'm not sure how to move it:

00000000 54 47 47 54 47 47 41 54 47 43 54 47 0a 3e 43 50 |TGGTGGATGCTG.>CP| 00000010 30 31 34 35 39 34 0a 41 43 41 47 43 43 47 41 43 |014594.ACAGCCGAC| 00000020 41 41 43 43 43 41 41 43 41 54 47 43 43 41 41 41 |AACCCAACATGCCAAA| 00000030 43 54 43 43 41 47 41 43 54 43 47 41 41 43 43 54 |CTCCAGACTCGAACCT| 00000040 47 47 47 41 43 54 43 43 41 41 47 41 41 54 43 41 |GGGACTCCAAGAATCA| 00000050 41 41 43 0a

and

00000000 43 50 30 31 34 35 39 34 09 2e 09 67 65 6e 65 09 |CP014594...gene.| 00000010 31 30 34 36 31 09 31 34 36 35 38 09 2e 09 2b 09 |10461.14658...+.| 00000020 2e 09 4e 61 6d 65 3d 41 54 59 34 30 5f 72 52 4e |..Name=ATY40_rRN| 00000030 41 63 53 43 37 73 31 30 34 36 31 65 31 34 36 35 |AcSC7s10461e1465| 00000040 38 3b 6c 6f 63 75 73 5f 74 61 67 3d 22 41 54 59 |8;locus_tag="ATY| 00000050 34 30 5f 72 52 4e 41 63 53 43 37 73 31 30 34 36 |40_rRNAcSC7s1046| 00000060 31 65 31 34 36 35 38 22 0a |1e14658".| 00000069

it looks like the characters are the same. i tried downloading another copy of bedtools (didn't work), but new copies of the ref.fa and ref.gff have worked. not sure why...maybe the error is misleading? thank you for your help.

ADD REPLY

Login before adding your answer.

Traffic: 2295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6