bedtools getfasta does not give a proper fasta output
2
0
Entering edit mode
9.0 years ago
Rohit ★ 1.5k

I have a problem with the output from bedtools getfasta, it used to work before. My bed-file looks like this -

C1_64090263 131 164 C1_64090263-1:33 V1_87574936:0 - 33 0

And I am using the command -

bedtools getfasta -fi in.fa -bed file.bed -s -name -fo out.fasta

This reference sequence is 200 bases long. But after extraction I get the sequences as follows

>C1_64090263-1:33
TCAAATTCATATGTTGAAATCTTAA>1_

Some invalid characters are formed at the end of sequence, and at the end of the file too many sequences have invalid fasta sequences such as -

>C1_70260404-1:51
\88\F4\BFH\F5\BF?\F6\BF\C8\F6\BF\88\F7\BFH\F8\BF?\F9\BF

There is no problem with the reference fasta as seqtk works on it, but the reverse-strand is also crucial for my sequences. This happens to most of the sequences on the reverse-strand,some on the plus strand too.

I tried without the name and strandedness options, and also replaced the 5th column with 0, but still it produces no proper fasta files.

Is there a solution to this problem?

fasta bedtools • 5.6k views
ADD COMMENT
1
Entering edit mode

How did you generate your FASTA file? It seems like you might have Unicode characters and/or missing line breaks between sequences and deflines.

ADD REPLY
0
Entering edit mode

The fasta file are contigs from a genome assembly. I extracted them according to their IDs with grep and processed to remove unwanted lines

ADD REPLY
1
Entering edit mode

What output do you get when you run samtools faidx [genome].fa C1_64090263-1:33?

ADD REPLY
0
Entering edit mode

samtools faidx or pyfaidx will be more sensitive to formatting issues but it's a good suggestion. In fact, I've built some good error messages into pyfaidx that might help you diagnose the issue.

ADD REPLY
0
Entering edit mode
samtools faidx [genome].fa C1_64090263-1:33

I get just the sequence header, no fasta sequence

When I run with just the sequence name, without the cordinates, I can extract the sequence.

But if it's a problem with the genome.fasta reference, why does Seqtk work?

ADD REPLY
1
Entering edit mode
2.5 years ago
Amanda ▴ 10

I know I am very late to the party here, but something a lot like this happens when the fasta index is older than the fasta, in case anyone is here looking for answers.

ADD COMMENT
0
Entering edit mode
9.0 years ago
mark.ziemann ★ 1.9k

I think this has to do with the end-of-line character (EOL). These are different for different operating systems and a constant source of frustration for programmers and analysts alike. Check out this blog post and this blog post for a description of the problem. Just because seqtk works on it doesn't mean everything will.

Can you tell us in which OS and program the fasta file was written?

There is some relevant solutions in the sed1line page.

ADD COMMENT
0
Entering edit mode

Hi Mark,

I'm using fedora 17. And the fasta file was from denovo assembly which was converted into single line fasta with a perl script.

I was trying a work around by working with the negative strand separate from + strand, extracting sequences with seqtk and then reverse complementing - strand ones.

A for-the-time-being solution but not the best.

Thanks for the links. I'll check them out and give an update.

ADD REPLY

Login before adding your answer.

Traffic: 2665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6