From a genome fasta file I would like create a bed file containing the start and end coordinates of all runs of >= 50 Ns, such as this:
and 1kb of flanking sequence on either side. So for example if there were 100 consecutive Ns at position:
Chr1 1050 1150
Then the output region in the bed would be
Chr1 50 2150
Because it would contain the 100 Ns and the 1kb flanks on either side. I'd like to output a list of all such regions in the genome.
Could anyone suggest an approach or software for this? I've seen a few packages for identiying strings of sequence in the genome such as WordCluster, but suspect there might be an existing tool specifically for this task as I imagine it's quite a common filtering approach.
Thanks in advance for any suggestions!