lowercase fasta coordinates to GFF
2
1
Entering edit mode
9.1 years ago

I have fasta file which has been soft-masked to lowercase. Now, I would like to get GFF (BED) file with coordinates of all soft-masked sequences. I was wondering if such a tool is already available. Thank you.

lowercase fasta GFF • 4.1k views
ADD COMMENT
5
Entering edit mode
9.1 years ago

Using gnu flex? (not fully tested, should work)

ADD COMMENT
0
Entering edit mode

Thanks for an answer! I haven't used flex before, could you please explain in greater detail how to install/run this code? (flex: can't open jeter.l)

ADD REPLY
1
Entering edit mode

in the current context GNU flex runs like a awk for C.

for example

^>.*\n

is a pattern for the a fasta header.

ADD REPLY
0
Entering edit mode

Perfect! Thank you very much. I will just add note for the others that code reports are 0-based.

ADD REPLY
0
Entering edit mode

sorry I renamed the file when copy+paste, that should be flex code.l not flex jeter.l

ADD REPLY
0
Entering edit mode

Hi,

I'm trying to do the same but while testing the above script, I figured that I'm not getting the coordinates of the softmasked region at the end of the sequence.

For example, testing the script on the following three cases,

>QWEQ
qweqertADEADASDAWDW
>ASAD
ASQWEQQqweqeASAasa
>adad
asasasdaa

I get back,

QWEQ    0       7
ASAD    7       12
ASAD    15      18

But how would I change the flex script to give back the coordinates of the sequence "adad"?

Thanks,

Cheers,

ADD REPLY
0
Entering edit mode

ah yes, I forgot the trailing lowercase letters. I've added a <<EOF>> condition.

ADD REPLY
0
Entering edit mode

This is very useful, but how do I run it with a locally stored .fasta file? I tried the following and I got an empty co-ordinate file and it doesn't finish running either. Also, is there a way to use multiple threads to run this?

flex code.l && \

gcc lex.yy.c && \

./a.out genome_softmasked.fasta > co-ords.gff

ADD REPLY
0
Entering edit mode
3.9 years ago
bcontreras ▴ 10

You can do this in Perl in two steps:

1) convert your FASTA file to 1 sequence per line:

perl -lne 'if(/^(>.*)/){ $head=$1 } else { $fa{$head} .= $_ } END{ foreach $s (sort(keys(%fa))){ print "$s\n$fa{$s}\n" }}'  infile.fasta > infile.1line.fasta

2) extract lower-case segments from that FASTA file:

perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]+)/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' infile.1line.fasta > infile.bed

In case you only want sements of some min length, 60 in the example:

perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]{60,})/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' infile.1line.fasta > infile.60.bed
ADD COMMENT

Login before adding your answer.

Traffic: 1737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6