kallisto pseudoalignment rate is lower than literature
0
0
Entering edit mode
2.8 years ago

Hi guys,

Recently, I just beginning a new project about meta-analysis of RNA-seq. The pipline that I choose is kallsito-tximport-WGCNA, but I just found my pseudoalignment rate is so low contrast to my reference(at the same pipline but I have more SRA files).

I just checked the possible reasons in biostars:1) different pipline generate different alignment rate; 2) trimmomatic may cause the lower alignment rate.

Therefore, I tried the Bowtie-Samtools pipline, kallisto from rawdata not do trimmomatic. However, these didn't work in my project.

Scince I have the reference literature, I thought the difference among us is the transcript index after re-checked it. But I can't understand if kallisto ignore the position of the reads compared to the genes, then the differences in transcripts should be mainly focused on the number of annotated genes. But the use of transcripts with a slightly lower number of genes should result in an acceptable comparison rate, instead of the difference between 1% and 50%.

pseudoalignment kallisto RNA-seq • 2.5k views
ADD COMMENT
0
Entering edit mode

I'm sorry, but it is very difficult to parse your question.

What does "transcripts with a slightly lower number of genes" mean? What does "position of the reads compared to the genes" mean? What does "pseudoalignment rate is so low contrast to my reference" mean?

Those sentences don't make any sense.

ADD REPLY
0
Entering edit mode

Thanks for your reply.

I'm sorry to have caused obstacles to your reading. I'm a newcomer in bioinformatics also a non-native English speaker. Maybe that makes this question harder to understand.

For the first question, I constructed a reference transcript index by CDS.fq from NCBI. The number of CDS from it is 3598 (that means this strain has 3598 genes). In my reference literature, they constructed a reference transcript index by the genome and additional ncRNAs from the other publication. Like I mentioned, my Kallisto pseudo alignment rate is lower than that in the reference literature. I'm wondering if the number of annotated genes will affect the pseudo alignment rate.

For the second question, I have read the Kallisto principle and know its focus on whether the reading can be mapped to the reference transcript, rather than on which position of the gene. So I think that means If the read can be mapped to a reference transcript, there will be a mapping rate. It should be independent of the number of all annotated genes.

For the third question, my pseudo alignment rate is so low in contrast to that in my reference literature.

Thanks again for your reply.

So I just want to solve my low pseudo alignment rate.

ADD REPLY
2
Entering edit mode

For the first question, I constructed a reference transcript index by CDS.fq from NCBI. The number of CDS from it is 3598 (that means this [bacterial] strain has 3598 [protein-coding] genes). In my reference literature, they constructed a [different] reference transcript index by the genome [likely similar to all CDS] and additional ncRNAs [which might conform with a considerable proportion of the reads] from the other publication [resulting in a more comprehensive transcriptome]. Like I mentioned, my Kallisto pseudo alignment rate is lower than that in the reference literature. I'm wondering if the number [and in particular comprehensiveness and correctness] of annotated genes will affect the pseudo alignment rate.

[my thoughts and interpretations]

I think given these considerations, the answer to the question is: Yes.

In other words you need to reproduce the (exact) same input data in order to be able to reproduce the results.

ADD REPLY
0
Entering edit mode

Thanks for your additional words.

I think I could learn more about how to express my question in Biostar with your generous help.

Also thanks for your answer, I will try it later.

Hope you have a nice day.

ADD REPLY
2
Entering edit mode

No need to apologize -- we all understand that bioinformaticians and bioinformatics students come from all over the world.

The .fa file and the .gtf files used can definitely affect pseudoalignment rate, especially if you're using the incorrect annotations (which may very well be the case). For example, you shouldn't be using CDS -- you should be using cDNA.

And yes, kallisto maps reads to transcripts -- but the mapping rate will be influenced by the index used.

Anyhow, my best guess given the current information is that your index is incorrect. I recommend constructing the index based on how it's done in the literature.

ADD REPLY
1
Entering edit mode

Never-mind, Michael has given me a better understanding of what you're trying to say (and I agree with Michael that "yes"; those things will affect pseudo-alignment rate).

ADD REPLY
0
Entering edit mode

Thanks for your reply.

I agree with you at the cDNA as reference transcript not CDS, but there isn't cDNA file in NCBI or Ensembl about this bacterial strain.

Besides, I tried Bowtie2-Samtools pipline which use reference genome to construct index, the aligment rate is lower (40%) than kallisto (60%). This pipline (Bowtie2 -Samtools) is not related to the CDS file. So I wonder if the error occurred to me during the pre-processing stage like Trimmomatic.

That makes me confusing.

ADD REPLY
1
Entering edit mode

Oh, I see -- in that case, it could definitely be due to some pre-processing step.

But still, don't rule out an index issue.

Maybe the paper made their index in some special way. Ideally, the paper should tell you exactly what files were used for creating the index and link to them and give the exact code used. Otherwise, all we're doing here is guessing.

ADD REPLY
1
Entering edit mode

So I wonder if the error occurred to me during the pre-processing stage like Trimmomatic.

It would be unlikely that something you did is causing this but it is possible, if you are a new user.

Besides, I tried Bowtie2-Samtools pipline which use reference genome to construct index, the aligment rate is lower (40%) than kallisto (60%)

You can't compare alignment rates of those two programs since they are fundamentally different methods. One is a normal NGS aligner where as the other does abundance estimation.

ADD REPLY
0
Entering edit mode

Thanks for your reply.

I'm sure that the probability of problems during pre-processing is low. Because I used Fastqc for quality control before and after Trimmomatic.

For the second thing, I still didn't figure out what you mean. In my opinion, the alignment rates should represent how many reads are available regardless of the comparison method. That means slight differences are acceptable,such as from 70% to79%.

I'm not quite sure I'm understanding this correctly. If not, please let me know.

Thanks again for your reply. Hope you have a nice day.

ADD REPLY
1
Entering edit mode

Did you check for the software versions? are they same?

ADD REPLY
0
Entering edit mode

Thanks for your reply.

I did check the version of kallisto. For my reference literature, it's 0.45. For me, I used the newest version 0.46. I think the new version should perform better.

Thanks again for your reply.

ADD REPLY
0
Entering edit mode

Note that better won't always mean a higher alignment rate, as it could be "better" by virtue of fewer spurious mappings, thereby resulting in a lower alignment rate while still being more accurate.

ADD REPLY
0
Entering edit mode

Thanks for your reply.

Your suggestion sounds quite correctly. I think I'm gonna use the old version to run my pipline.

Have a nice day.

ADD REPLY

Login before adding your answer.

Traffic: 2125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6