Question: bedtools getfasta chromosome not found
1
gravatar for from the mountains
2.4 years ago by
United States
from the mountains110 wrote:

I am getting a curious error when using bedtools to extract sequences from a reference fasta and a gff3 file.

I enter: /gpfs/gwngs/tools/bedtools2/bin/bedtools getfasta -s -fullHeader -fi ref.fa -bed ref.gff3

And I get a million lines saying, for example:

WARNING. chromosome (CP014594) was not found in the FASTA file. Skipping.

The thing is I grepped the fasta header for this chromosome and got:

>CP014594

There are no white spaces in the line. what's going on? i have tried inserting the prefix 'chr' into both files, and inserting a space between > and the chromosome name for the fasta file.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by from the mountains110

what is the output of

file ref.fa

?

ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum129k

ref.fa: ASCII text, with CRLF line terminators

ADD REPLYlink written 2.4 years ago by from the mountains110

00000000 54 47 47 54 47 47 41 54 47 43 54 47 0a 3e 43 50 |TGGTGGATGCTG.>CP| 00000010 30 31 34 35 39 34 0a 41 43 41 47 43 43 47 41 43 |014594.ACAGCCGAC| 00000020 41 41 43 43 43 41 41 43 41 54 47 43 43 41 41 41 |AACCCAACATGCCAAA| 00000030 43 54 43 43 41 47 41 43 54 43 47 41 41 43 43 54 |CTCCAGACTCGAACCT| 00000040 47 47 47 41 43 54 43 43 41 41 47 41 41 54 43 41 |GGGACTCCAAGAATCA| 00000050 41 41 43 0a |AAC.| 00000054

and

00000000 43 50 30 31 34 35 39 34 09 2e 09 67 65 6e 65 09 |CP014594...gene.| 00000010 31 30 34 36 31 09 31 34 36 35 38 09 2e 09 2b 09 |10461.14658...+.| 00000020 2e 09 4e 61 6d 65 3d 41 54 59 34 30 5f 72 52 4e |..Name=ATY40_rRN| 00000030 41 63 53 43 37 73 31 30 34 36 31 65 31 34 36 35 |AcSC7s10461e1465| 00000040 38 3b 6c 6f 63 75 73 5f 74 61 67 3d 22 41 54 59 |8;locus_tag="ATY| 00000050 34 30 5f 72 52 4e 41 63 53 43 37 73 31 30 34 36 |40_rRNAcSC7s1046| 00000060 31 65 31 34 36 35 38 22 0a |1e14658".| 00000069

ADD REPLYlink written 2.4 years ago by from the mountains110
0
gravatar for Pierre Lindenbaum
2.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

your fasta is a windows file . https://en.wikipedia.org/wiki/Newline#Frequent_issues

remove the invisible '\r' with

tr -d '\r' < ref.fa > tmp.fa
mv tmp.fa ref.fa

and try again.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Pierre Lindenbaum129k
1

alternative command line fix: use the dos2unix command (dos2unix yourfile.fa)

ADD REPLYlink written 2.4 years ago by cmdcolin1.4k

good catch! unfortunately, it still doesn't work, in spite of both files now being ASCII text with no \r .

ADD REPLYlink written 2.4 years ago by from the mountains110

hum... output of

grep CP014594 -B 1 -A 1 ref.fa | hexdump -C ?

and

grep -w CP014594 ref.gff3 | tail -1  | hexdump -C ?
ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum129k

for some reason my answer posted above, i'm not sure how to move it:

00000000 54 47 47 54 47 47 41 54 47 43 54 47 0a 3e 43 50 |TGGTGGATGCTG.>CP| 00000010 30 31 34 35 39 34 0a 41 43 41 47 43 43 47 41 43 |014594.ACAGCCGAC| 00000020 41 41 43 43 43 41 41 43 41 54 47 43 43 41 41 41 |AACCCAACATGCCAAA| 00000030 43 54 43 43 41 47 41 43 54 43 47 41 41 43 43 54 |CTCCAGACTCGAACCT| 00000040 47 47 47 41 43 54 43 43 41 41 47 41 41 54 43 41 |GGGACTCCAAGAATCA| 00000050 41 41 43 0a

and

00000000 43 50 30 31 34 35 39 34 09 2e 09 67 65 6e 65 09 |CP014594...gene.| 00000010 31 30 34 36 31 09 31 34 36 35 38 09 2e 09 2b 09 |10461.14658...+.| 00000020 2e 09 4e 61 6d 65 3d 41 54 59 34 30 5f 72 52 4e |..Name=ATY40_rRN| 00000030 41 63 53 43 37 73 31 30 34 36 31 65 31 34 36 35 |AcSC7s10461e1465| 00000040 38 3b 6c 6f 63 75 73 5f 74 61 67 3d 22 41 54 59 |8;locus_tag="ATY| 00000050 34 30 5f 72 52 4e 41 63 53 43 37 73 31 30 34 36 |40_rRNAcSC7s1046| 00000060 31 65 31 34 36 35 38 22 0a |1e14658".| 00000069

it looks like the characters are the same. i tried downloading another copy of bedtools (didn't work), but new copies of the ref.fa and ref.gff have worked. not sure why...maybe the error is misleading? thank you for your help.

ADD REPLYlink written 2.4 years ago by from the mountains110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 798 users visited in the last hour