fasta to bed format with more information that just chr locations
3
1
Entering edit mode
6.5 years ago
varsha619 ▴ 90

Hello, I have a fasta file with lines of format -

>FBti0019256 type=transposable_element; loc=2L:22300300..22304444; name=invader2{}555; dbxref=FlyBase_Annotation_IDs:TE19256,FlyBase:FBti0019256; MD5=d9259a0e33aad699215e64916bd47a5b; length=4145; release=r6.19; species=Dmel;

I would like to convert these lines into a bed file of format -

chr2L /t 22300300 /t 22304444 /t invader2

Is there a program that can directly perform this conversion or an awk command that can do this easier? Please let me know, thank you for your help.

fasta to bed • 2.3k views
ADD COMMENT
0
Entering edit mode

varsha619 : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.
Upvote|Bookmark|Accept

ADD REPLY
4
Entering edit mode
6.5 years ago

use an associative array in awk to store each component.

I'm just too lazy to extract the chrom/start/end, you get the idea.

awk -F ' ' '/^>/ {delete map;for(i=1;i<=NF;i++) {eq=index($i,"=");if(eq==0) continue;key=substr($i,1,eq-1);val=substr($i,eq+1); gsub(/;$/,"",val); map[key]=val;} OFS="\t"; print map["loc"],map["type"],map["name"],map["dbxref"],map["MD5"];}'
ADD COMMENT
4
Entering edit mode
6.5 years ago
Daniel ★ 4.0k

My approach is lazier... Just split on the punctuation, pull out the columns you want, and separate by "\t".

awk -F'[ *:.]' -v OFS="\t" '{print "chr"$5,$6,$8,$13}' tmp.txt

# Output
chr2L   22300300    22304444    invader2
ADD COMMENT
0
Entering edit mode

OP wants the output file to be in BED format with columns separated by tabs \t.

ADD REPLY
0
Entering edit mode

@Daniel, would you mind explaining how the columns $5 and $10 are defined, when I run the script it does not output the right values. And yes I would like the output values to be tab separated. Thanks!

ADD REPLY
1
Entering edit mode

The $n parameters are the columns when you divide the file up by what is inside the -F'[ ]' box, so if you split the line every time you saw a space, asterix, colon, or period then you'd want the 5th, 6th, 8th and 13th columns to be printed.

ADD REPLY
0
Entering edit mode

Huh, I was certain that bed format was colon and hyphen but the internet doesn't support that... Maybe I'm going mad. Ok, updating.

Edit: I was thinking of UCSC browser formatting apparently.

ADD REPLY
2
Entering edit mode
6.5 years ago

Assumption is that chromosomes are numbered, no x and y chromosomes. For fasta headers only, user can use grep

$ sed 's/.*loc.\(\b.*\b\):\(\b.*\b\)\.\.\(\b.*\b\);.*=\(\b.*\b\){.*/chr\1\t\2\t\3\t\4/g' test.txt 
chr2L   22300300    22304444    invader2

$ cut -f2,3 -d";" test.txt| cut -d= -f2,3 | awk -v OFS="\t" -F':|=|;|\\..|{' '{print "chr"$1,$2,$3,$5}'
chr2L   22300300    22304444    invader2

$ cat test.txt 
>FBti0019256 type=transposable_element; loc=2L:22300300..22304444; name=invader2{}555; dbxref=FlyBase_Annotation_IDs:TE19256,FlyBase:FBti0019256; MD5=d9259a0e33aad699215e64916bd47a5b; length=4145; release=r6.19; species=Dmel;

1000 bash cuts:

$ cut -f2,3 -d";" test.txt| cut -d= -f2,3 | cut -f1,2 -d: --output-delimiter=$'\t'| cut -f1,2,3 -d'.' --output-delimiter=$'\t' | cut -f1,2 -d';' --output-delimiter=$'\t'| cut -f1,2 -d'=' --output-delimiter=$'\t' |  cut -f1,2 -d'{' --output-delimiter=$'\t'| cut -f1,2,3,4,6

2L  22300300        22304444    invader2
ADD COMMENT
0
Entering edit mode

Thank you! I liked the 2nd method the best and seemed to be the easiest to use - cut -f2,3 -d";" test.txt| cut -d= -f2,3 | awk -v OFS="\t" -F':|=|;|\..|{' '{print "chr"$1,$2,$3,$5}'

ADD REPLY
1
Entering edit mode

Actually, I think @ Daniel's one is best awk solution for OP requirement. In line with @Daniel's solution below, following awk solution is better than one using two cuts above :).

$ awk -v OFS="\t"  -F';|=|:|\\..|{' '/^>/ {print "chr"$4,$5,$6,$8}' test.txt
chr2L   22300300    22304444    invader2

input:

>FBti0019256 type=transposable_element; loc=2L:22300300..22304444; name=invader2{}555; dbxref=FlyBase_Annotation_IDs:TE19256,FlyBase:FBti0019256; MD5=d9259a0e33aad699215e64916bd47a5b; length=4145; release=r6.19; species=Dmel;
ADD REPLY

Login before adding your answer.

Traffic: 874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6