I have some paired end reads which are messing up my assembly process. Some reads span across two distinct genes, which confuses my transcript assembly software. I end up with a very long transcript merging both genes.
I have a bed file which contains the positions of all the transcripts I should reasonnably get (ie, a list of separate intervals).
I can easily identify the erroneous transcripts with bedtools intersect (a single transcript intersect two different genes).
Now, is there a way to find which pairs of reads are causing this issue, and how can I delete them?
If I consider only the fused genes, is there a way to delete a pair of aligned reads whose inner mate distance is greater than a given value?
I would rather delete them directly from my current sorted bam files rather than from the FastQ file I used (the alignment and sorting process is a bit long in my case).
Thanks a lot
An example: top annotation is mm10, bottom is obtained de novo. You are looking at olfactory receptor genes which are close and similar.