lowercase fasta coordinates to GFF
2
1
Entering edit mode
10.2 years ago

I have fasta file which has been soft-masked to lowercase. Now, I would like to get GFF (BED) file with coordinates of all soft-masked sequences. I was wondering if such a tool is already available. Thank you.

lowercase fasta GFF • 4.7k views
ADD COMMENT
5
Entering edit mode
10.2 years ago

Using gnu flex? (not fully tested, should work)

%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static char* seqname=NULL;
static long beg=0L;
static long end=0L;
static int in_lower=0;
static void print_bed()
{
printf("%s\t%ld\t%ld\n",seqname,beg,end);
}
%}
%option noyywrap
%%
<<EOF>> {if(in_lower) print_bed(); return 0;}
^>.*\n {if(in_lower) print_bed(); free(seqname);seqname=strndup(&yytext[1],yyleng-2);beg=end=0;in_lower=0;}
[A-Z]+ {if(in_lower) {print_bed(); in_lower=0;} end+=yyleng;}
[a-z]+ {if(!in_lower) {beg=end;in_lower=1;} end+=yyleng;}
[ \t\n\r]+ ;
. {fprintf(stderr,"ERROR\n"); exit(-1);}
%%
int main(int argc,char** argv)
{
yylex();
return 0;
}
view raw code.l hosted with ❤ by GitHub
$ flex code.l && \
gcc lex.yy.c && \
curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chrUn_gl000249.fa.gz" |\
gunzip -c | ./a.out
chrUn_gl000249 583 646
chrUn_gl000249 873 1720
chrUn_gl000249 1964 2885
chrUn_gl000249 2912 3126
chrUn_gl000249 3186 3894
chrUn_gl000249 4209 4845
chrUn_gl000249 4847 4956
chrUn_gl000249 4970 5361
chrUn_gl000249 5583 5889
chrUn_gl000249 5935 6233
chrUn_gl000249 6349 6543
chrUn_gl000249 6752 6780
chrUn_gl000249 7075 7097
chrUn_gl000249 7457 7570
chrUn_gl000249 7639 7728
chrUn_gl000249 7768 8151
chrUn_gl000249 8378 9020
chrUn_gl000249 9345 9637
chrUn_gl000249 9683 9953
chrUn_gl000249 9962 10278
chrUn_gl000249 10534 10599
chrUn_gl000249 10614 10680
chrUn_gl000249 10878 10980
chrUn_gl000249 11250 11358
chrUn_gl000249 11915 12293
chrUn_gl000249 12323 12486
chrUn_gl000249 12610 12914
chrUn_gl000249 13032 13090
chrUn_gl000249 13149 13371
chrUn_gl000249 13678 14014
chrUn_gl000249 14312 14599
chrUn_gl000249 14632 14796
chrUn_gl000249 14850 14939
chrUn_gl000249 15504 15595
chrUn_gl000249 15596 16393
chrUn_gl000249 16483 16583
chrUn_gl000249 17190 17395
chrUn_gl000249 17416 17697
chrUn_gl000249 17857 17915
chrUn_gl000249 18640 19662
chrUn_gl000249 20652 20739
chrUn_gl000249 20884 20965
chrUn_gl000249 20984 21148
chrUn_gl000249 22387 22441
chrUn_gl000249 22865 23159
chrUn_gl000249 23538 23649
chrUn_gl000249 25105 25321
chrUn_gl000249 25380 25598
chrUn_gl000249 26375 26675
chrUn_gl000249 27024 27227
chrUn_gl000249 27572 27709
chrUn_gl000249 27768 28017
chrUn_gl000249 28103 28418
chrUn_gl000249 29790 29859
chrUn_gl000249 30971 31080
chrUn_gl000249 31277 31309
chrUn_gl000249 32295 32431
chrUn_gl000249 32807 32828
chrUn_gl000249 33588 33632
chrUn_gl000249 34252 34323
chrUn_gl000249 36020 36169
chrUn_gl000249 37894 37915
view raw run.txt hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

Thanks for an answer! I haven't used flex before, could you please explain in greater detail how to install/run this code? (flex: can't open jeter.l)

ADD REPLY
1
Entering edit mode

in the current context GNU flex runs like a awk for C.

for example

^>.*\n

is a pattern for the a fasta header.

ADD REPLY
0
Entering edit mode

Perfect! Thank you very much. I will just add note for the others that code reports are 0-based.

ADD REPLY
0
Entering edit mode

sorry I renamed the file when copy+paste, that should be flex code.l not flex jeter.l

ADD REPLY
0
Entering edit mode

Hi,

I'm trying to do the same but while testing the above script, I figured that I'm not getting the coordinates of the softmasked region at the end of the sequence.

For example, testing the script on the following three cases,

>QWEQ
qweqertADEADASDAWDW
>ASAD
ASQWEQQqweqeASAasa
>adad
asasasdaa

I get back,

QWEQ    0       7
ASAD    7       12
ASAD    15      18

But how would I change the flex script to give back the coordinates of the sequence "adad"?

Thanks,

Cheers,

ADD REPLY
0
Entering edit mode

ah yes, I forgot the trailing lowercase letters. I've added a <<EOF>> condition.

ADD REPLY
0
Entering edit mode

This is very useful, but how do I run it with a locally stored .fasta file? I tried the following and I got an empty co-ordinate file and it doesn't finish running either. Also, is there a way to use multiple threads to run this?

flex code.l && \

gcc lex.yy.c && \

./a.out genome_softmasked.fasta > co-ords.gff

ADD REPLY
0
Entering edit mode
5.0 years ago
bcontreras ▴ 10

You can do this in Perl in two steps:

1) convert your FASTA file to 1 sequence per line:

perl -lne 'if(/^(>.*)/){ $head=$1 } else { $fa{$head} .= $_ } END{ foreach $s (sort(keys(%fa))){ print "$s\n$fa{$s}\n" }}'  infile.fasta > infile.1line.fasta

2) extract lower-case segments from that FASTA file:

perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]+)/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' infile.1line.fasta > infile.bed

In case you only want sements of some min length, 60 in the example:

perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]{60,})/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' infile.1line.fasta > infile.60.bed
ADD COMMENT

Login before adding your answer.

Traffic: 5566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6