Why does cufflinks split this transcript?
0
1
Entering edit mode
4.7 years ago
corend ▴ 70

I have a RNA-seq dataset. I used Tophat and Cufflinks to align to my reference genome and make an assembly of my transcripts driven by a GTF of my genome.

I have an interest gene (9 exons), I am looking if it is expressed in my data. When I look at it in the assembly, I found it split in 2 parts. The 3 first exons in a transcript. And the 6 last exons in another transcript. I was thinking that the expression level was too low to make a good assembly of this transcript. But when I look at the sashimiplot from the .bam file used for the assembly, here is what I have:

I have enough reads supporting the junction, but cufflinks reports 2 transcripts.

Any idea about why is this happening?

EDIT: Could it be comming from my parameters?

-u
--min-frags-per-transfrag 10
--max-multiread-fraction 0.99
--trim-3-avgcov-thresh 5
--trim-3-dropoff-frac=0.1
--overlap-radius 50

RNA-Seq cufflinks assembly • 2.0k views
ADD COMMENT
1
Entering edit mode

There supposingly are too few reads to support the junction between exon 3 and 4. Try to reduce the threshold (I guess it's --min-frags-per-transfrag) to a lower value, see what happens!

ADD REPLY
1
Entering edit mode

that's the junction with the most reads in the whole transcript

do you know if these reads are uniquely mapped?

ADD REPLY
0
Entering edit mode

I just filtered out multimapped reads, and I have 67 reads supporting the junction instead of 77. It seems that this is not a multimapping problem :/

ADD REPLY
0
Entering edit mode

did you check how do they overlap the junction? maybe a majority of them only has few bases on one exon...

ADD REPLY
0
Entering edit mode

Nope, they overlap very well (not worse than any other junctions). I changed all parameters one by one, I always get this transcript split in2. The only way to have it full is to use faux-reads from a guide annotation (and I don't want to use a guide annotation).

ADD REPLY
1
Entering edit mode

I'm sorry I don't have any other ideas as I'm not an expert. For transcript quantification cufflinks has long been known for being sub-optimal, maybe some other software that is specifically built for assembly like SPAdes or Trinity would be a better choice?

ADD REPLY
0
Entering edit mode

Sorry, I thought you asked the next questions about N-stretches. I answered about why I use cufflinks and not other programs in the next comment. Thank you for the help!

ADD REPLY
0
Entering edit mode

How is the reference in that region? Is there an N-stretch in between?

ADD REPLY
0
Entering edit mode

Nope, there is not a single N in that region. I could use another aligner, but cufflinks is part of a pipeline I am using (not mine).

I retrieved the command line that runs cufflinks, the bam used, other files, etc...(in the pipeline code). I reran cufflinks separately from the pipeline changing parameters and I get the same result, with this transcript unassembled.

Anyway, I will keep in mind that I have to check when a transcripts is badly assembled if there are reads supporting the junction.

If you have any other ideas, you are welcome! Thanks for the help :)

ADD REPLY
0
Entering edit mode

The reads you're using... how did you filter them? Did you keep only primary alignments (-F 0x0100) and proper pairs (-f 0x2)? Maybe it's not connecting those two exons because not all the reads that map there are not primary/properly paired.

ADD REPLY
0
Entering edit mode

I did not filter them. When am I supposed to use -F 0x0100 and -f 0x2?

ADD REPLY
1
Entering edit mode

samtools view options. See: http://www.htslib.org/doc/samtools.html

ADD REPLY
0
Entering edit mode

I just filtered my bam and checked manually, they are still there. I also checked manually on igv, those junction reads are not secondary and are proper pair.

ADD REPLY
0
Entering edit mode

Do they terminate before a splicing start site (f.e. "GT") and begin again after a splicing end site (f.e. "AG")? If not, maybe that's the reason.

ADD REPLY
0
Entering edit mode

It is :

XXXXGA GTXXXXXXXXAG CGXXXXX

Exon | intron | Exon

ADD REPLY
1
Entering edit mode

I think it could be because the two splicing sites have a couple of point mutations with respect to the U2 and U12 known splice sites, and even with that, they seem to look at each other and not to be consecutive:

XXXXGA TGXXXXXXXXGA CGXXXXX
-->            <--


Bottom line being:

The reads map in split mode across the intron, on the two flanking exons, so clearly the transcript belongs together, but the prediction of the gene model based on splice sites cannot do so, because the sites face each other and that's not a likely thing to observe wihtin a gene.

ADD REPLY
1
Entering edit mode

sorry but are you sure the splice site isn't correct? GU at the start of the intron and AG at its end is the standard thing, isn't it?

ADD REPLY
0
Entering edit mode

Actually, you're right. I thought they were 3 exons with the splice sites shown. If the middle one is an intron, then my answer is not correct!

ADD REPLY
1
Entering edit mode

sorry for being a killjoy ;)

ADD REPLY
0
Entering edit mode

I guess you are right, if you want to post this as an answer I could validate it. Anyway, it means that on other transcripts, this problem might not be always present so it is ok for me. Thanks a lot :)

ADD REPLY

Login before adding your answer.

Traffic: 1213 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6