Entering edit mode
4 weeks ago
rajdeepboral00
▴
70
The commands i used for my RNASeq bulk data analysis are
1. trim_galore --quality 30 --length 30 --cores 8 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --basename CC1 --paired --fastqc CC1_S16_L002_R1_001.fastq.gz CC1_S16_L002_R2_001.fastq.gz
2. STAR --runThreadN 8 --runMode genomeGenerate --genomeDir Index --genomeFastaFiles GRCm39.genome.fa --sjdbGTFfile gencode.vM37.basic.annotation.gtf --sjdbOverhang 149
3. STAR --genomeDir /Index --runThreadN 10 --readFilesIn CC1_val_1.fq.gz CC1_val_2.fq.gz --outFileNamePrefix CC1_ --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --quantMode GeneCounts
Using these commands i am getting only 37%mapped and others are not...37% is way too less a number...Maybe i am making an error somewhere, can anyone please suggest what more i can include to get a better and more mapped result.
that is a low number indeed, however it can also be reflecting your data quality (and thus not necessarily mistakes in your bioinfo part) . The commands you include look OK to me.
The biggest issue is that 'noFeature' category which means they are mapped but not assigned to a (gene) feature.
Can you asses your input data quality? Eg. (and most importantly) was the data rRNA depleted for instance? Is the RNAseq derived from the same species/strain , ...
Where this noFeature reads can align to then? How to assess the quality> yes the method for library prep itself included rRNA depletion...RNASeq derived from the same species means?
The mouse genome is only about 2% coding exons. 98% of the genome is sequence that is not present in a mature mRNA transcript. Most of your reads are aligning to this sequence.
The three most common causes of this would be:
1)I have checked for rRNA, but no such contamination was found. 2) & 3) Yes that could be the possibility though.
Thanks.
Hopefully, because there aren't any other explanations :)
just out of curiosity, and for completeness, can you provide the numbers that you used to get to the 'only 37% mapped' result?
42439450 these are the reads that got mapped.
that is the number when summing all the reads assigned to genes then? (== mapped & assigned)
what is the number of reads in the initial input file (fastq) ?
The initial fastq after trimming had 107653608 reads
Aha, I suspected something like this already :) :
as you have 107653608 input reads and only have 3098536 unmapped reads, you actually have a mapping rate of ~97%!! (far from the 37% you mentioned initially).
Yes, only have ~39% reads assigned , for explanations on that aspect see i.sudbery 's answer
Thank you for the reply