Question: Why does cufflinks split this transcript?
1
gravatar for corend
5 months ago by
corend30
corend30 wrote:

I have a RNA-seq dataset. I used Tophat and Cufflinks to align to my reference genome and make an assembly of my transcripts driven by a GTF of my genome.

I have an interest gene (9 exons), I am looking if it is expressed in my data. When I look at it in the assembly, I found it split in 2 parts. The 3 first exons in a transcript. And the 6 last exons in another transcript. I was thinking that the expression level was too low to make a good assembly of this transcript. But when I look at the sashimiplot from the .bam file used for the assembly, here is what I have:

Sashimi Plot of my gene

I have enough reads supporting the junction, but cufflinks reports 2 transcripts.

Any idea about why is this happening?

EDIT: Could it be comming from my parameters?

-u 
--min-frags-per-transfrag 10 
--max-multiread-fraction 0.99
--trim-3-avgcov-thresh 5 
--trim-3-dropoff-frac=0.1 
--overlap-radius 50
rna-seq cufflinks assembly • 371 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by corend30
1

There supposingly are too few reads to support the junction between exon 3 and 4. Try to reduce the threshold (I guess it's --min-frags-per-transfrag) to a lower value, see what happens!

ADD REPLYlink written 5 months ago by Macspider2.4k
1

that's the junction with the most reads in the whole transcript

do you know if these reads are uniquely mapped?

ADD REPLYlink modified 5 months ago • written 5 months ago by Martombo2.1k

I just filtered out multimapped reads, and I have 67 reads supporting the junction instead of 77. It seems that this is not a multimapping problem :/

ADD REPLYlink written 5 months ago by corend30

did you check how do they overlap the junction? maybe a majority of them only has few bases on one exon...

ADD REPLYlink written 5 months ago by Martombo2.1k

Nope, they overlap very well (not worse than any other junctions). I changed all parameters one by one, I always get this transcript split in2. The only way to have it full is to use faux-reads from a guide annotation (and I don't want to use a guide annotation).

ADD REPLYlink written 5 months ago by corend30
1

I'm sorry I don't have any other ideas as I'm not an expert. For transcript quantification cufflinks has long been known for being sub-optimal, maybe some other software that is specifically built for assembly like SPAdes or Trinity would be a better choice?

ADD REPLYlink written 5 months ago by Martombo2.1k

Sorry, I thought you asked the next questions about N-stretches. I answered about why I use cufflinks and not other programs in the next comment. Thank you for the help!

ADD REPLYlink written 5 months ago by corend30

How is the reference in that region? Is there an N-stretch in between?

ADD REPLYlink written 5 months ago by Macspider2.4k

Nope, there is not a single N in that region. I could use another aligner, but cufflinks is part of a pipeline I am using (not mine).

I retrieved the command line that runs cufflinks, the bam used, other files, etc...(in the pipeline code). I reran cufflinks separately from the pipeline changing parameters and I get the same result, with this transcript unassembled.

Anyway, I will keep in mind that I have to check when a transcripts is badly assembled if there are reads supporting the junction.

If you have any other ideas, you are welcome! Thanks for the help :)

ADD REPLYlink modified 5 months ago • written 5 months ago by corend30

The reads you're using... how did you filter them? Did you keep only primary alignments (-F 0x0100) and proper pairs (-f 0x2)? Maybe it's not connecting those two exons because not all the reads that map there are not primary/properly paired.

ADD REPLYlink written 5 months ago by Macspider2.4k

I did not filter them. When am I supposed to use -F 0x0100 and -f 0x2?

ADD REPLYlink written 5 months ago by corend30
1

samtools view options. See: http://www.htslib.org/doc/samtools.html

ADD REPLYlink written 5 months ago by Macspider2.4k

I just filtered my bam and checked manually, they are still there. I also checked manually on igv, those junction reads are not secondary and are proper pair.

ADD REPLYlink modified 5 months ago • written 5 months ago by corend30

Do they terminate before a splicing start site (f.e. "GT") and begin again after a splicing end site (f.e. "AG")? If not, maybe that's the reason.

ADD REPLYlink written 5 months ago by Macspider2.4k

It is :

XXXXGA GTXXXXXXXXAG CGXXXXX

Exon | intron | Exon

ADD REPLYlink modified 5 months ago • written 5 months ago by corend30
1

I think it could be because the two splicing sites have a couple of point mutations with respect to the U2 and U12 known splice sites, and even with that, they seem to look at each other and not to be consecutive:

XXXXGA TGXXXXXXXXGA CGXXXXX
    -->            <--

Bottom line being:

The reads map in split mode across the intron, on the two flanking exons, so clearly the transcript belongs together, but the prediction of the gene model based on splice sites cannot do so, because the sites face each other and that's not a likely thing to observe wihtin a gene.

ADD REPLYlink modified 5 months ago • written 5 months ago by Macspider2.4k
1

sorry but are you sure the splice site isn't correct? GU at the start of the intron and AG at its end is the standard thing, isn't it?

ADD REPLYlink written 5 months ago by Martombo2.1k

Actually, you're right. I thought they were 3 exons with the splice sites shown. If the middle one is an intron, then my answer is not correct!

ADD REPLYlink written 5 months ago by Macspider2.4k
1

sorry for being a killjoy ;)

ADD REPLYlink written 5 months ago by Martombo2.1k

I guess you are right, if you want to post this as an answer I could validate it. Anyway, it means that on other transcripts, this problem might not be always present so it is ok for me. Thanks a lot :)

ADD REPLYlink written 5 months ago by corend30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 596 users visited in the last hour