I am trying to annotate some genomic features that are found within 20kb of host genes. I have a bed file containing the genes, and another with the features. So far, I have been able to identify the overlap between features and genes using bedtools window:
bedtools window -w 20000 -a genes.bed -b features.bed
However, I am now exploring how these features have altered the landscape around genes. To do this, I want to identify all the features found within 20KB of HOST sequence. I.e if there was 5000bp of features within the 20kb window, I would like the window to be 25kb, where it then needs to keep expanding until 20kb of host sequence has been considered, excluding any sequence contributed by features.
I had thought about doing this iteratively using bedtools window, where I would start with 20kb, count the features bases in each flank, then add that many to each locus. But I have hit issues with downstream flanks of one gene hitting upstream of another, and also hitting gene loci. I have been told that there might be solutions in Pybedtools or Pyranges, but I am not too great at python (currently learning!) and so am a bit stuck.
I am not really sure where to start, or what the best solution for this might be. I would be really grateful for any pointers or solutions that anyone might come up with!
Update: I think there might be a better way to do this than with bedtools. The datasets I have are as follows:
genome.fa
>chr_1
ACCCTTTGAGAGAGACTTTGAGAGTGCGATGACGTTTTGCAAAAAGTGCA
repeats.bed
chr_1 5 10
chr_1 18 24
genes.bed
chr_1 30 38
For example with the gene spanning 30 to 38, I would like to find all the repeats within 20000 bases, where the 20000 is excluding bases marked as repeats, and also where the counter for flanking sequence stops if another gene is in the way.
Therefore, if I was looking for the flanking region, for example, 20 bases either side of gene spanning 30 to 38, the desired output would be the coordinates of the flanks containing 20 bases of host sequence. If I look just upstream of this gene, there is a repeat at 18 to 24 (6 bases), so up to this repeat, 6 bases are comprised of repeat (counter for flank is at 6). Skipping the repeat, the counter can start again at 18 until I hit the repeat finishing at base 10 (another 8 bases of host for a total of 14). Skipping the repeat, the counter again starts at base 5 and the host sequence terminates at the start of the fasta entry, (5 bases) for a total of 19 in that flank. Going the other way, there are no repeats, so the flank can stay as 20 bp of host sequence. If there was another gene, this flank would have to terminate at the next gene.
Desired output:
chr_1 0 30 upper_flank_chr_1_30_38
chr_1 38 58 lower_flank_chr_1_30_38