Similarity of yeast strains at the nucleotide level - RNA-Seq
0
0
Entering edit mode
22 months ago
palu • 0

Dear Community, I need some help with an analysis I need to do for the reviewers before my manuscript can be published. My Bioinformatics level is beginner. See the case, pipeline and the questions below.

Case:
I did an RNA-seq differential gene expression study using two yeast strains (strain A and B) and three media conditions (M1, M2, M3) in Triplicates (I-III) resulting in the following samples:

StrainA_M1_I
StrainA_M1_II
StrainA_M1_III
StrainA_M2_I
StrainA_M2_II
StrainA_M2_III
StrainA_M3_I
StrainA_M3_II
StrainA_M3_III
StrainB_M1_I
StrainB_M1_II
StrainB_M1_III
StrainB_M2_I
StrainB_M2_II
StrainB_M2_III
StrainB_M3_I
StrainB_M3_II
StrainB_M3_III

My Pipeline was:

I needed to build an index for the reference genome prior read mapping.

Questions:
My manuscript was accepted but the reviewers want some additional analysis:

1.) Quantification of the number of SNPs identified by RNA seq: Similarity of the strains at the nucleotide level?
2.) Unmapped reads: Are there any transcripts that don't map to the reference?
3.) Which of the unmapped reads are involved in the metabolic pathways I am investigating?

Concerning 1.):
How would a pipeline look like quantifying the number of SNPs identified by RNA seq?
I heard exactSNP from the Subread package could do the job.
Does it make sense to use all samples for this kind of analysis or only a subset?

Concerning 2.):
How can I identify unmapped reads?
I guess the results are different for each sample.

Concerning 3.):
What is a good way to check, if unmapped reads belong to a subset of genes of interest?

For the analyses, I have the reference genome + the above mentioned RNA-Seq data of the samples.

Thanks a lot

rna-seq SNP sequencing • 430 views
0
Entering edit mode

Quantification of the number of SNPs identified by RNA seq

Surprised reviewers are asking you to do that. SNP calling using RNAseq data is not a norm. If subread has SNP calling functionality then try it first. If strains are closely related there would be a small number of SNP's, IF they are present.

GATK has a pipeline available for RNAseq SNP calling.

How can I identify unmapped reads?

No familiar with subread but it may have an option to collect upmapped reads in a separate file. Otherwise you can filter your alignment file to collect unmapped reads using samtools, provided subread includes them in the alignment file.

Which of the unmapped reads are involved in the metabolic pathways I am investigating?

You will need to try and assemble unmapped reads to first identify new transcripts. Then annotate them and finally try to map on to pathways.

if unmapped reads belong to a subset of genes of interest?

If they did originate from genes in reference, then the reads should not remain unmapped (as long as the aligner did its job well)?

Having said all this, with yeast the number unmapped reads in your samples should be relatively small (< 10%) so I am not sure if you are going to get a lot out of this whole lot of work. Have you tried to push back against reviewers suggestion?

0
Entering edit mode

Did subread provide you with stats about the % of reads that remained unaligned?

0
Entering edit mode

@genomax: Thanks for your answer. So far we have not pushed back against the reviewers. They do not expect an in-depth analysis but want at least some statements about similarity of the strains on nucleotide level - based on the RNA-seq reads. We used subreads exactSNP on the BAM files and the samples have around 2000 SNPs compared to the reference genome.

Related to question 2.) Unfortunately, subread seems to have no function to collect unmapped reads, it can just show the number of unmapped reads. I will try samtools or STAR aligner.

0
Entering edit mode

Please use ADD REPLY/ADD COMMENT to keep the threads logically organized. SUBMIT ANSWER is only for new answers to original question.

2000 SNP's seems reasonable. I assume that is the total across all strains?