Which Of The Genes Are Enriched With Repeat Elements
0
2
Entering edit mode
8.5 years ago
roll ▴ 330

I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements.

I have a bam file and the repeats in bed format.

As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements.

I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use

a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene.

For example,

chr   start  end gene number_of_repeats
chr1 100  200  gene1 70
chr1 190  240  gene1 40
chr1 250  400  gene1 100
chr2 500  600  gene2 150


if i sort and merge them i will get

chr1 100  240  gene1 90
chr1 250  400  gene1 100
chr2 500  600  gene2 150


So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats.

Or

b) shall I count the number of repeats which for each short sequence without any merging? so i will also get some insight into the gene counts vs .number of repeats?

For example using the same example above, i will get

for gene1 210 (70 + 40 +100) and for gene2 150 number of repeats.

Or

Am i on the completely wrong track and should think a better solution?

bam repeats gene bedtools • 3.1k views
3
Entering edit mode

That's a tricky question (see my Statistics: Tandem repeat enrichment between two sets of sequences question). There are different ways to solve such problem - for example, you can count number of repeats per gene or gene sequence in % covered with repeats. Also gene length matters (in your example gene1 = 290 bp and gene2 = 100bp).

0
Entering edit mode

would you merge the sequences or use them as it is? Considering sequence length is a good point. My first thought is to normalize the number of repeats with sequence length. Can I count these and correlate with gene expression for example? Did you find out how they obtained the figures in the paper that you share with your post?

0
Entering edit mode

when merging, is the same strandness important here?

1
Entering edit mode

The repeats are a property of the gene. Why do you bring short read sequence data in?

0
Entering edit mode

If i get it right, in the bam file i have the positions of each short read (i.e. i do not have one entry for each gene) How can I count the total number of repeats within or nearby of that particular gene then?

0
Entering edit mode

Why don't you: for all the genes, calculate some statistics regarding repeats (e.g. repeat coverage stratified by conservation). Then for the genes where you find enrichment in short reads (e.g. peak finding etc...) e.g at the promoter, do a test of the repeat statistic vs the rest of the genes where you could theoretically find reads based on mappability in the promoter region or condition1 short reads vs condition2 short reads.

0
Entering edit mode

shall I first merge the overlapping short reads, using bedtools merge for example? And do they have to be in the same strand? Or leave it as it is?