Question

Stringtie assembling and quantifying using multiple datasets with various read lengths

0

Entering edit mode

2.1 years ago

dadrasarmin ▴ 20

Hi,

I want to perform an RNA-Seq analysis with multiple datasets from separate publicly available studies, and the lengths of reads are different between these groups.

I use HISAT2 to map the reads to the genome and I use Stringtie for assembling and quantifying. I want to use Limma and edgeR package for downstream analyses, therefore I need raw counts. I use the below code to produce expression files (I use it with the default read length =75):

stringtie -p {threads} -G {input.gtf} -B -e -o {output.output_path_assembled_transcripts_GTF} {input.sorted_bam}

Then, I use the tximport package to import gene expression data into R like this (Again, with the default read length = 75):

Txi_gene <- tximport(path, type = "stringtie", countsFromAbundance = "lengthScaledTPM", tx2gene = Tx, txOut = F, ignoreTxVersion = F)

I prefer to not use the prepDE.py file prepared by Stringtie and use tximport instead, however, the logic behind the question is the same.

My guess is that if the read length is calculated somewhere in Stringtie, tximport (or prepDE) will use it to do some calculations and predict the count number. My question is whether what I am doing is correct or I have to do the analysis with "real" read length for each group of samples and combine the results at the end.

Tximport Stringtie • 865 views

ADD COMMENT • link 2.1 years ago by dadrasarmin ▴ 20

0

Entering edit mode

Different studies, are you aware of batch effects here? In such a setting the read length is probably the least of your problems. Can you even meaningfully compare different studies?

ADD REPLY • link 2.1 years ago by ATpoint 81k

0

Entering edit mode

Hi @ATpoint, I will check and try to handle the batch effect with available tools. I am going to try a few tools to see what can I do there. Before that, I need to be sure what I am doing is correct.

ADD REPLY • link 2.1 years ago by dadrasarmin ▴ 20

0

Entering edit mode

Most likely I can already tell you now that you cannot compare them. Can you describe the data you have? It is a very common misconception that combining many unrelated datasets is going to yield anything meaningful. As said, in the presence of batch effects due to data coming from different studies the read length is by far the least of your worries. If you are afraif of that then trim them all to the same length.

ADD REPLY • link 2.1 years ago by ATpoint 81k

0

Entering edit mode

Let's consider this other scenario. I did three sequencing of the same conditions and control at the same time with the same person etc. The only variation is that I sequenced them with three different lengths (50, 100, 200). Since they are all biological replicates and produced from the same batch via the same person, I am not concerned about the batch effect.

In this scenario, using the default length is correct in your opinion or not?

ADD REPLY • link 2.1 years ago by dadrasarmin ▴ 20

0

Entering edit mode

I would probably just them them all to 50, then you don't have to worry about anything. On the global scale it will probably not matter but for difficult-to-map genes one might introduce a mapping bias.

ADD REPLY • link 2.1 years ago by ATpoint 81k

0

Entering edit mode

Thanks for your suggestion. However, as you said, by trimming I will introduce a mapping bias.

ADD REPLY • link 2.1 years ago by dadrasarmin ▴ 20