Question: Aligning RNA-seq data against just a small set of transcripts of interest (and counting)
0
gravatar for jpascualanaya
3.4 years ago by
Japan
jpascualanaya10 wrote:

Would it be possible to align a whole RNA-seq against just a particular small set of transcripts, and not to a whole transcriptome? I.e., for example, just Hox genes, or just Wnt genes.

I am asking this because I am working with non-model organisms, with no genome or decent transcriptome available. After doing several attempts for a de novo assmebly transcriptomes, there was no way to get complete genes for most of those I am interested in, and way too many chimeric genes. But, after manually curating and tons of PCR I have now very reliable sequences for my set of transcripts. I would like to get expression level measures of this set of 43 genes in 8 developmental stages, and so although qPCRs are possible, I would rather try first to use the RNA-seq I have.

I thought on doing something similar to this: Create GFF from de novo assembly to input on htseq-counts

Align the RNA-seq datasets against the 43 genes (really low % of alignment expected), count the tags and calculate TPM myself. I just need the TPMs to then standardize (z-score) the data by gene.

Would that make sense?

Edit: edited title. I want to align and count

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by jpascualanaya10
0
gravatar for WouterDeCoster
3.4 years ago by
Belgium
WouterDeCoster41k wrote:

I have the impression that you are mixing up two things. Do you want to align only against a small set of transcripts (as your title says) or do you want to perform counting only for a certain set (as indicated by the custom gff)? To what will you align if you don't have a reference genome available?

ADD COMMENTlink written 3.4 years ago by WouterDeCoster41k

Sorry if I didn't explain well. I want to allign and count.

So, I have a multi fasta of 43 genes whose sequence I have manually curated and now I want to have some measure of their expression levels at different developmental timings. What I plan to do is to align the RNA-seq data against those 43, let's say using bowtie, then count the reads aligned, using for example samtools, and then caluclate TPMs.

bowtie --> samtools --> TPM --> z-scores

The post I cite is just similar to my question, but I don't need the GTF. I was just citing it because the answer lead to something similar to my problem, but while their the whole transcriptome assembly is used, I wonder if using just a small set would be OK, since all methods I've seen align agains the whole transcriptome. As a matter of fact, for instance I used RSEM against just this set of 43 genes but obtained insanely high levels of expression (which are not true), so I was wondering if doing what I pretend is flawed somehow.

As for your last question, you can align directly against a transcriptome.

ADD REPLYlink written 3.4 years ago by jpascualanaya10
2

Your proposed method will lead to incorrectly high counts, since bowtie will produce more false positives due to having sequences from the whole transcriptome but only a few genes to align against. Use salmon or kallisto to get counts against the entire transcriptome and subset that to whatever you need.

ADD REPLYlink written 3.4 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 3318 users visited in the last hour