Question

Extracting interval from fasta header

0

Entering edit mode

6.2 years ago

rbronste ▴ 420

Hi,

Im wondering about the most straightforward way to extract the interval information contained in a fasta header such as the one below, thanks! Also maybe to pipe into a newly created bed file.

>Mouse|chr12:112380949-112381824 | element 3 | positive  | neural tube[4/4] | hindbrain (rhombencephalon)[4/4] 
aAaaGAATAAGGCTCTGACATGTTATCCttaagagatttactgttatctgcctgtttcatgtttgctttttctttgacacagaatgtaatgcagcccagactggcctcaacttagtatgt

fasta header bed interval • 1.6k views

ADD COMMENT • link updated 6.2 years ago by Alex Reynolds 35k • written 6.2 years ago by rbronste ▴ 420

0

Entering edit mode

Hello,

we are currently developing SEDA (http://www.sing-group.org/seda/ ), an application to easily process FASTA files.

Among its functions, the "Rename header" option (see section 3.7 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. Please, have a look at it and do not hesitate to contact us in case you need further help.

With best regards,

Hugo.

ADD REPLY • link updated 6.2 years ago by GenoMax 141k • written 6.2 years ago by Hugo ▴ 380

score 3 · Answer 1 · 2018-01-21

3

Entering edit mode

6.2 years ago

venu 7.1k

You can use the following oneliner

cat fasta_file.fa | grep '^>' | cut -d "|" -f2 | sed -e 's/:/\t/' -e 's/-/\t/' > fasta_to_bed.bed

This will generate a file with bed intervals

chr12   112380949   112381824

P.S. You might be interested to sort the bed file.

ADD COMMENT • link 6.2 years ago by venu 7.1k

score 1 · Answer 2 · 2018-01-21

You can use awk and BEDOPS sort-bed to generate a sorted BED file from the intervals in the headers:

$ awk -vOFS="\t" -vFS="|" '/^>/ { n=split($2, a, /[:-]/); print a[1], a[2], a[3]; }' in.fa | sort-bed - > out.bed

If you use OS X, use Homebrew to install GNU awk via: brew install gawk, and replace awk with gawk.