Remove paired end reads joining two distinct genes
0
0
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

Hi,

I have some paired end reads which are messing up my assembly process. Some reads span across two distinct genes, which confuses my transcript assembly software. I end up with a very long transcript merging both genes.

I have a bed file which contains the positions of all the transcripts I should reasonably get (i.e., a list of separate intervals).

I can easily identify the erroneous transcripts with bedtools intersect (a single transcript intersect two different genes). Now, is there a way to find which pairs of reads are causing this issue, and how can I delete them? If I consider only the fused genes, is there a way to delete a pair of aligned reads whose inner mate distance is greater than a given value?

I would rather delete them directly from my current sorted bam files rather than from the FastQ file I used (the alignment and sorting process is a bit long in my case).

Thanks a lot

image: http://imgur.com/o49vAzb

An example: top annotation is mm10, bottom is obtained de novo. You are looking at olfactory receptor genes which are close and similar.

RNA-Seq paired-end • 1.9k views
ADD COMMENT
0
Entering edit mode

What program are you using to generate the erroneous transcripts?

ADD REPLY
0
Entering edit mode

In this case it is IsoSCM. However, a previous assembly was realized using Cufflinks in an article. According to their materials section: ad hoc perl scripts were used to further refine the gene models produced for VR and OR genes, deleting those predictions that fuse adjacent receptor genes or that are antisense to the annotated gene.

I have created various alignments with more or less stringent limits on maximal intron size (7,10 and 25kb respectively), using STAR. I can choose to only consider models derived from a specific alignment, BUT here I would rather like to exclude a few faulty reads.

I have to pick between fused genes and a limit on intron size which might affect valid transcripts.

It is an effective approach, but in IGV it is very easy to see information on reads (like the position of a mate if it has been aligned). I have just no idea how to access this information (apart from using a parser on the sam file which would be ugly).

ADD REPLY
0
Entering edit mode

Ok, Bamutils has a filter option which allows me to exclude reads not in a region. I can try to consider only one BAM file with 25kb introns, and for each fused genes, take a window of say 10kb. It is still the same idea as before...

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6