Question

RNAseq Processing abnormal results

0

Entering edit mode

22 months ago

mathew • 0

Hello, I am a beginner in RNAseq analysis and ran into a couple of challenges.

I am trying to process the Desulfivibrio Vulgaris H. RNAseq dataset (SRP071071) https://www.ncbi.nlm.nih.gov/sra?term=SRP071071, from the Illumina platform. The Reference genome(.fna) and annotation (.gtf) files were obtained from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195755.1#/qa

Below are the steps I took and the results I got.

I performed a quick check of the quality of the reads using

qualityScores(input_format = ".fasq files",offset = 33) - which shows that the Phred score is just one number (30). I dont know if this is anything to worry about. It is common to find it in range of 0- 41 for Illumina.

I also checked the first 4 lines screenshop of first 4 lines of the read showing that the Phred score is 30

with this, I did not trim the reads since they all have a uniform score.

I built the index with buildindex() function from Rsubread using default settings
I performed alignment using align () from Rsubread using default settings The align function generated the following warning

All six samples reported these warnings. I don't really know how to resolve this problem. I read a few blogs and I could not figure out what might be wrong so I proceeded with the analysis.

Below is the summary of the alignment for one of the samples

alignment sumarry

I performed feature counts with featureCounts() using the annot.ext and isGTFAnnotation =TRUE parameters. other settings remain at default. I got the following matrix (dimension = 90 x 6) for the six samples

count results

I felt this is not normal. I was expecting a data matrix of about 3000 features(genes). Why did I get this result? why is featureCounts() detecting only 90 features from the annotation (gtf) file?

Among other things, I have tried to compare the .BAM files with the .GTF file to see if there is difference in chromosome naming (as suggested in featureCounts: Low percentage of assigned fragments). The following are the .BAM and GTF few rows;

.BAM file

.GTF file Is there something wrong with these files?

Thank you, Mathew

annotation Rsubread featureCounts • 764 views

ADD COMMENT • link updated 22 months ago by GenoMax 141k • written 22 months ago by mathew • 0

0

Entering edit mode

Based on the screenshot above (please copy paste data in future, while your screenshots are clear they don't allow people to grab data for testing if needed), 90 genomic features is obviously a problem.

Can you actually confirm where you downloaded the genome sequence and annotation? There are multiple possibilities. You could have gone to:

Genome page: https://www.ncbi.nlm.nih.gov/genome/?term=txid881[orgn]
RefSeq folder: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/755/GCF_000195755.1_ASM19575v1/

to get the sequence and annotation.

As a reference if I look at the RefSeq GTF file I see 3343 gene entries.

$ zcat GCF_000195755.1_ASM19575v1_genomic.gtf.gz | awk -F "\t" '$3 == "gene"  {print $0}' | wc -l

ADD REPLY • link 22 months ago by GenoMax 141k

0

Entering edit mode

I actually got my reference genome and annotation file from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195755.1#/qa and it contains 3343 genes. But featureCounts() is detecting 90 genes and I don't know how to fix that.

ADD REPLY • link 22 months ago by mathew • 0

0

Entering edit mode

I assume you used Download Assembly button and then got the sequence and GTF files? You could create a simple annotation format (SAF) file by choosing only those rows that say gene from annotation file and then choosing appropriate columns to make a file like this. Use your own data.

GeneID      Chr Start   End Strand
497097      chr1    3204563 3207049 -
497097      chr1    3411783 3411982 -
497097      chr1    3660633 3661579 -
100503874   chr1    3637390 3640590 -
100503874   chr1    3648928 3648985 -
100038431   chr1    3670236 3671869 -

featureCounts can use this file format. That should help fix the issue.

ADD REPLY • link 22 months ago by GenoMax 141k

0

Entering edit mode

In featureCount() the parameter strandSpecific=0 by default (unstranded). If you check the M&M:

Strand-specific cDNA libraries were constructed, and Illumina paired-end sequencing (HiSeq 2000; Illumina, Inc., San Diego, CA, USA) was performed (Macrogen, Seoul, South Korea) on the extracted RNA.

Try to set strandSpecific=1 instead.

Finally, no rRNA depletion is mentioned in the M&M, so it is possible that only a small percentage of reads will map against your CDS.

ADD REPLY • link 22 months ago by andres.firrincieli 3.6k

0

Entering edit mode

Thank you @andres.firrincieli, I set the strandSpecific=1 but the successfully assigned alignment percentage then reduced enter image description here

ADD REPLY • link 22 months ago by mathew • 0