Extracting interval from fasta header
2
0
Entering edit mode
6.2 years ago
rbronste ▴ 420

Hi,

Im wondering about the most straightforward way to extract the interval information contained in a fasta header such as the one below, thanks! Also maybe to pipe into a newly created bed file.

>Mouse|chr12:112380949-112381824 | element 3 | positive  | neural tube[4/4] | hindbrain (rhombencephalon)[4/4] 
aAaaGAATAAGGCTCTGACATGTTATCCttaagagatttactgttatctgcctgtttcatgtttgctttttctttgacacagaatgtaatgcagcccagactggcctcaacttagtatgt
fasta header bed interval • 1.6k views
ADD COMMENT
0
Entering edit mode

Hello,

we are currently developing SEDA (http://www.sing-group.org/seda/ ), an application to easily process FASTA files.

Among its functions, the "Rename header" option (see section 3.7 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. Please, have a look at it and do not hesitate to contact us in case you need further help.

With best regards,

Hugo.

ADD REPLY
3
Entering edit mode
6.2 years ago
venu 7.1k

You can use the following oneliner

cat fasta_file.fa | grep '^>' | cut -d "|" -f2 | sed -e 's/:/\t/' -e 's/-/\t/' > fasta_to_bed.bed

This will generate a file with bed intervals

chr12   112380949   112381824

P.S. You might be interested to sort the bed file.

ADD COMMENT
1
Entering edit mode
6.2 years ago

You can use awk and BEDOPS sort-bed to generate a sorted BED file from the intervals in the headers:

$ awk -vOFS="\t" -vFS="|" '/^>/ { n=split($2, a, /[:-]/); print a[1], a[2], a[3]; }' in.fa | sort-bed - > out.bed

If you use OS X, use Homebrew to install GNU awk via: brew install gawk, and replace awk with gawk.

ADD COMMENT

Login before adding your answer.

Traffic: 2341 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6