Question: Why I have too many reads non uniquely aligned?
0
gravatar for Angel
17 days ago by
Angel3.5k
Angel3.5k wrote:

Hi

By this code I have obtained my raw read counts from RNA-seq

Alignment by STAR

STAR --genomeDir ./STAR_hg38_Genome --readFilesIn $fastq1 $fastq2

Sorting sam

samtools view -h -o aligned.sam sort.bam

Bam 2 Sam

samtools view -h -o aligned.sam ligned.sorted.bam

GeneCount

htseq-count --stranded=no -q aligned.sam ./gtf > counts.txt

But I have this results; Too many reads non uniquely mapped

__no_feature    427159
__ambiguous 1250237
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  10648052

I tried to summarized raw read counts by Featurecounts I saw rate of summarization was low

Load annotation file Homo_sapiens.GRCh38.dna.primary_assembly.gtf ...      ||
||    Features : 1262162                                                      ||
||    Meta-features : 58735                                                   ||
||    Chromosomes/contigs : 47                                                ||
||                                                                            ||
|| Process SAM file aligned.sam...                                            ||
||    Paired-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 47361817                                             ||
||    Successfully assigned alignments : 23092843 (48.8%)                     ||
||    Running time : 1.34 minutes

Should I worry about htseq results? You think problem comes from where? When I tried to sort sam by picard I saw less alignment_not_unique, Does it make any sense?

rna-seq alignment • 144 views
ADD COMMENTlink modified 17 days ago by boaty90 • written 17 days ago by Angel3.5k

What is the read length?

ADD REPLYlink written 17 days ago by ATpoint23k

Sorry how I should know the read length? Yes my ultimate goal is differential expression analysis

ADD REPLYlink written 17 days ago by Angel3.5k

Just do head your.fastq and count how long the read in line 1 is. If the fastq is compressed do zcat your.fq.gz | head

ADD REPLYlink written 17 days ago by ATpoint23k

Thank you

[fi1d18@cyan01 ~]$ head ./1.fastq
@NB501007:194:HNKMVBGXB:1:11101:20881:1030 1:N:0:GTCCGC
GGGGCNGTGTTGCCATTATTAACTCTTTTGCTTTCATTTAATGCTCTCGCCCTGCCGCCGCCGGTGCTCCGTC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:11582:1032 1:N:0:GTCCGC
GTTGGNGGCCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATCCTGCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:20558:1032 1:N:0:GTCCGC
CCGCANGTGGTGGTTCTCCCCGAGGTCGTGGTTGTGCAGGTCAATGAGGGCCTGGACCGCCTCCTCCACGGAGCC
[fi1d18@cyan01 ~]$
ADD REPLYlink written 17 days ago by Angel3.5k

Ok, and the reverse read, so 2.fastq?

ADD REPLYlink written 17 days ago by ATpoint23k
[fi1d18@cyan01 ~]$ head ./2.fastq
@NB501007:194:HNKMVBGXB:1:11101:20881:1030 2:N:0:GTCCGC
NACGGAGCACCGGCGGCGGCAGGGCGAGAGCATTAAATGAAAGCAAAAGAGTTAATAATGGCAACACGGCCCC
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:11582:1032 2:N:0:GTCCGC
NAAGGCCTTCGATACGGGATAATCCTATTTATTACCTCAGAAGTTTTTTTCTTCGCAGGATTTTTCTGAGCCTTT
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB501007:194:HNKMVBGXB:1:11101:20558:1032 2:N:0:GTCCGC
NCTGTTTTCCAGCAATGGGGGCGTCGTCAAAGGATTCAAGTTCTTCCAGAAGGACCGCAAGATGGCACTGATCC
[fi1d18@cyan01 ~]$
ADD REPLYlink written 17 days ago by Angel3.5k
2

Ok, so normal 2x75bp. I was asking because overly short reads like 2x25 often lead to multimappers but in this case it probably comes down to what swbarnes2 says, having rRNA or similar contaminations. I would not worry too much about this. Proceed with the normal analysis, use something like PCA to check for potential batch effects (DESeq2 vignette has good manual on how to do that) and then see if results are reasonable after DEG.

ADD REPLYlink modified 17 days ago • written 17 days ago by ATpoint23k

For dif. gene expression analysis (I assume you do it because you count the reads), multimappers are usually not counted, though having high proportion of multimapped reads is not desired. What is the read length of your data? STAR has several options (e.g. --outFilterMismatchNmax) to restrict the number of mismatches to the reference, which can influence the mapping rates.

ADD REPLYlink written 17 days ago by grant.hovhannisyan1.7k

Additionally, the overlap resolution mode from htseq can influence your final read count stats, chech the docs to see which mode better fits your data.

ADD REPLYlink written 17 days ago by leaodel100
3
gravatar for swbarnes2
17 days ago by
swbarnes26.5k
United States
swbarnes26.5k wrote:

Reads that align to rRNA are typically multimappers. Your sample might for some reason be enriched for those kinds of reads. This would be a sample type/library prep issue.

RSEM is a program like feature counts that works on STAR output ( in this case, you have to tell STAR to make a transcriptome output bam) that is smarter about proportionately assigning multimappers to genes.

ADD COMMENTlink written 17 days ago by swbarnes26.5k

Thank you so much both; Sorry @swbarnes2 what you mean by transcriptome output bam? Should I add additional command to STAR alignment code for that?

You think if I just use feature counts instead, the problem would solve?

ADD REPLYlink modified 17 days ago • written 17 days ago by Angel3.5k

6 Output in transcript coordinates. With --quantMode TranscriptomeSAM option STAR will outputs alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file (in addition to alignments in genomic coordinates in Aligned.*.sam/bam files). These transcriptomic alignments can be used with various transcript quantification software that require reads to be mapped to transcriptome, such as RSEM or eXpress. For example, RSEM command line would look as follows: rsem-calculate-expression ... --bam Aligned.toTranscriptome.out.bam /path/to/RSEM/reference RSEM

ADD REPLYlink written 17 days ago by swbarnes26.5k
2
gravatar for boaty
17 days ago by
boaty90
boaty90 wrote:

Hi,

I think it's fine, depends on what kind of analysis you want to perform? Here the main question is, do you have enough read depth or statistical power to do a GDE?

you have 23M reads out of aligner and 10M of them are not unique so 13M unique reads for analysis. I think it's OK for GDE.

RNA-seq normally has issue with rRNA, so we treat them with ribo-zero kit. And one of the important indicator for a good RNA-seq is ratio of number of rRNA over number of total RNA. this is called annotation-based quality control.

Even your ribo-zero faild, having 90% of rRNA overall, but you have > 10M mapped unique reads, still you can get something out of water in GDE

ADD COMMENTlink modified 17 days ago • written 17 days ago by boaty90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1883 users visited in the last hour