Question

Converting unstranded samples to strand specific

0

Entering edit mode

6.5 years ago

lirongrossmann ▴ 40

Hi All, I have a two datasets of rna-seq samples, one consists of strand-specific protocol (Truseq) and the other one unstranded (Clontech’s SMART). I would like to use both datasets (to increase the power of my study) and tried batch effect correction, but it did not go well (I still see two clear groups separated on pca according to the the protocol used). Is there a way to account for the difference between the protocol at the mapping/counting levels? My understanding is that the principle difference between the two sequencing techniques is that the unstranded will generate reads from both strands, even if one strand was actually expressed. Is there a way to get rid of the strands that were not expressed by using my strand dataset (assuming that strands that are not expressed in the strand dataset should not be expressed in the unstranded dataset as well)? Thanks a lot!

RNA-Seq strand • 2.1k views

ADD COMMENT • link updated 6.5 years ago by Friederike 8.9k • written 6.5 years ago by lirongrossmann ▴ 40

score 1 · Answer 1 · 2017-10-24

1

Entering edit mode

6.5 years ago

Friederike 8.9k

I think your title is a bit misleading - you're not trying to actually convert the sample type (which would be impossible since this must be happening at the time of the library preparation). If I understand you correctly, what you want is to filter reads from the unstranded data set based on information from the stranded dataset.

There are so many issues with that, it's hard to even get started. I am pretty sure you would introduce way more bias than trying to account for the fact that you used two different library preps.

First of all, I don't see how you can justify the assumption that " that strands that are not expressed in the strand dataset should not be expressed in the unstranded dataset as well". There are many reasons why you may not detect a transcript (e.g., you never captured it for the cDNA; it got degraded etc.) and the lack of expression is just one of them.

Secondly, you're dealing with randomly fragmented pieces! Just try to envision how you would match the different pieces from the different library preps. I'm not saying it's absolutely impossible, but it does not seem worth pursuing.

I'm sure there are many more details that make this task a rather undesirable one, but I hope these two points already illustrate the magnitude of the problem.

ADD COMMENT • link 6.5 years ago by Friederike 8.9k

0

Entering edit mode

Thank you for the detailed answer. I agree with your comments. I may not have been explicitly clear about what I would like to achieve from the conversion. I built a model to predict groups based on their gene expression using the stran specific samples. I want to verify my model using the unstranded samples and some of the remaining stranded samples (I don't have many to begin with). I was hoping there is a way to compare the expression levels between the strand specific samples and the unstranded samples. Also, it's worth noting that my alignment algorithm was based on splice site orientation, so I was able to infer the strand for the unstranded reads. I know I may be losing a lot information (such as novel genes etc'...), but I am not trying to detect genes, just compare levels of expression for selected genes. Thanks

ADD REPLY • link 6.5 years ago by lirongrossmann ▴ 40

0

Entering edit mode

At least for non-overlapping genes the TPM values should be comparable if the experimental conditions were the same. If you see great differences there, the issue is most likely not just due to the different library prep types.

ADD REPLY • link 6.5 years ago by Friederike 8.9k

score 0 · Answer 2 · 2017-10-24

0

Entering edit mode

6.5 years ago

Devon Ryan 104k

What you want to do is fundamentally impossible.

ADD COMMENT • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

Will you please be able to briefly explain me why?

ADD REPLY • link 6.5 years ago by lirongrossmann ▴ 40

0

Entering edit mode

The only way to know which strand an unstranded fragment arose from would be to align it and, if it happens to align to a single gene, assume it arose from a given gene and not from the opposite strand. Since unstranded reads are slightly more prone to multimapping as is, you'll already be biased by that. That combined with the bias of assuming that antisense transcription never occurs will further compound the incorrectness of the results.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k