I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements.
I have a bam file and the repeats in bed format.
As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements.
I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use
a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene.
chr start end gene number_of_repeats chr1 100 200 gene1 70 chr1 190 240 gene1 40 chr1 250 400 gene1 100 chr2 500 600 gene2 150
if i sort and merge them i will get
chr1 100 240 gene1 90 chr1 250 400 gene1 100 chr2 500 600 gene2 150
So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats.
b) shall I count the number of repeats which for each short sequence without any merging? so i will also get some insight into the gene counts vs .number of repeats?
For example using the same example above, i will get
for gene1 210 (70 + 40 +100) and for gene2 150 number of repeats.
Am i on the completely wrong track and should think a better solution?