Question: Remove paired end reads joining two distinct genes
gravatar for cyril-cros
5.7 years ago by
cyril-cros910 wrote:

I have some paired end reads which are messing up my assembly process. Some reads span across two distinct genes, which confuses my transcript assembly software. I end up with a very long transcript merging both genes.

I have a bed file which contains the positions of all the transcripts I should reasonnably get (ie, a list of separate intervals).

I can easily identify the erroneous transcripts with bedtools intersect (a single transcript intersect two different genes).
Now, is there a way to find which pairs of reads are causing this issue, and how can I delete them?
If I consider only the fused genes, is there a way to delete a pair of aligned reads whose inner mate distance is greater than a given value?

I would rather delete them directly from my current sorted bam files rather than from the FastQ file I used (the alignment and sorting process is a bit long in my case).
Thanks a lot

An example: top annotation is mm10, bottom is obtained de novo. You are looking at olfactory receptor genes which are close and similar. 

rna-seq paired-end reads • 1.4k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by cyril-cros910

What program are you using to generate the erroneous transcripts?

ADD REPLYlink written 5.7 years ago by doliv0710

In this case it is IsoSCM However, a previous assembly was realized using Cufflinks in an article. According to their materials section: ad hoc perl scripts were used to further refine the gene models produced for VR and OR genes, deleting those predictions that fuse adjacent receptor genes or that are antisense to the annotated gene.

I have created various alignments with more or less stringent limits on maximal intron size (7,10 and 25kb respectively), using STAR. I can choose to only consider models derived from a specific alignment, BUT here I would rather like to exclude a few faulty reads.
I have to pick between fused genes and a limit on intron size which might affect valid transcripts.

It is an effective approach, but in IGV it is very easy to see information on reads (like the position of a mate if it has been aligned). I have just no idea how to access these informations (apart from using a parser on the sam file which would be ugly).

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by cyril-cros910

Ok, Bamutils has a filter option which allows me to exclude reads not in a region. I can try to consider only one BAM file with 25kb introns, and for each fused genes, take a window of say 10kb. It is still the same idea as before...

ADD REPLYlink written 5.7 years ago by cyril-cros910
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1647 users visited in the last hour