Hello, I am a beginner in RNAseq analysis and ran into a couple of challenges.
I am trying to process the Desulfivibrio Vulgaris H. RNAseq dataset (SRP071071) https://www.ncbi.nlm.nih.gov/sra?term=SRP071071, from the Illumina platform. The Reference genome(.fna) and annotation (.gtf) files were obtained from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195755.1#/qa
Below are the steps I took and the results I got.
- I performed a quick check of the quality of the reads using
qualityScores(input_format = ".fasq files",offset = 33) - which shows that the Phred score is just one number (30). I dont know if this is anything to worry about. It is common to find it in range of 0- 41 for Illumina.
I also checked the first 4 lines
with this, I did not trim the reads since they all have a uniform score.
I built the index with buildindex() function from Rsubread using default settings
I performed alignment using align () from Rsubread using default settings The align function generated the following warning
All six samples reported these warnings. I don't really know how to resolve this problem. I read a few blogs and I could not figure out what might be wrong so I proceeded with the analysis.
Below is the summary of the alignment for one of the samples
- I performed feature counts with featureCounts() using the annot.ext and isGTFAnnotation =TRUE parameters. other settings remain at default. I got the following matrix (dimension = 90 x 6) for the six samples
I felt this is not normal. I was expecting a data matrix of about 3000 features(genes). Why did I get this result? why is featureCounts() detecting only 90 features from the annotation (gtf) file?
- Among other things, I have tried to compare the .BAM files with the .GTF file to see if there is difference in chromosome naming (as suggested in featureCounts: Low percentage of assigned fragments). The following are the .BAM and GTF few rows;
Is there something wrong with these files?
Thank you, Mathew
Based on the screenshot above (please copy paste data in future, while your screenshots are clear they don't allow people to grab data for testing if needed), 90 genomic features is obviously a problem.
Can you actually confirm where you downloaded the genome sequence and annotation? There are multiple possibilities. You could have gone to:
Genome page: https://www.ncbi.nlm.nih.gov/genome/?term=txid881[orgn]
RefSeq folder: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/755/GCF_000195755.1_ASM19575v1/
to get the sequence and annotation.
As a reference if I look at the RefSeq GTF file I see 3343 gene entries.
I actually got my reference genome and annotation file from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195755.1#/qa and it contains 3343 genes. But
featureCounts()
is detecting 90 genes and I don't know how to fix that.I assume you used
Download Assembly
button and then got the sequence and GTF files? You could create a simple annotation format (SAF) file by choosing only those rows that saygene
from annotation file and then choosing appropriate columns to make a file like this. Use your own data.featureCounts
can use this file format. That should help fix the issue.In
featureCount()
the parameterstrandSpecific=0
by default (unstranded). If you check the M&M:Try to set
strandSpecific=1
instead.Finally, no rRNA depletion is mentioned in the M&M, so it is possible that only a small percentage of reads will map against your CDS.
Thank you @andres.firrincieli, I set the
strandSpecific=1
but the successfully assigned alignment percentage then reduced