Question: high % of reads mapped to multiple loci after STAR mapping
1
gravatar for Lila M
3.0 years ago by
Lila M 860
UK
Lila M 860 wrote:

Hi everybody, I have a set of RNA-seq (single end files). After mapping with STAR (himan genome)

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file1.fastq.gz --runThreadN 16 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

I have a huge % of reads mapped to multiple loci

                          Number of input reads |   49078760
                      Average input read length |   74
                                    UNIQUE READS:
                   Uniquely mapped reads number |   9806908
                        Uniquely mapped reads % |   19.98%
                          Average mapped length |   73.42
                       Number of splices: Total |   1022832
            Number of splices: Annotated (sjdb) |   985106
                       Number of splices: GT/AG |   1005417
                       Number of splices: GC/AG |   6850
                       Number of splices: AT/AC |   496
               Number of splices: Non-canonical |   10069
                      Mismatch rate per base, % |   0.94%
                         Deletion rate per base |   0.02%
                        Deletion average length |   2.33
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.22
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   36919747
             % of reads mapped to multiple loci |   75.23%
        Number of reads mapped to too many loci |   186914
             % of reads mapped to too many loci |   0.38%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   3.97%
                     % of reads unmapped: other |   0.45%

I would be grateful for any suggestions.

rna-seq single-end star • 6.2k views
ADD COMMENTlink modified 3.0 years ago by Hussain Ather940 • written 3.0 years ago by Lila M 860
2

First thing to check is if those are rRNA reads.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k

How can I check it?

ADD REPLYlink written 3.0 years ago by Lila M 860
1

Using BBDuk.

ADD REPLYlink written 3.0 years ago by h.mon31k
1

You can find entire sequence of human rDNA repeat here. Use it with bbduk.sh.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k
1

Was anything suspicious in the FastQC report? Have you checked for rRNA or other contamination?

For RNA-Seq, I'd use the ENCODE settings described in the STAR manual. This parameter set worked well for me in various experiments.

ADD REPLYlink written 3.0 years ago by michael.ante3.6k
2
gravatar for Hussain Ather
3.0 years ago by
Hussain Ather940
National Institutes of Health, Bethesda, MD
Hussain Ather940 wrote:

Reads that map to more than 10 loci are counted as mapping to too many loci. You can change --outFilterMultimapNmax to increase the threshold. Make sure to also increase winAnchorMultimapNmax if you increase it to greater than 50.

ADD COMMENTlink written 3.0 years ago by Hussain Ather940

And that may be too many to begin with. OP is not trying to make the multi-mapping go away :)

ADD REPLYlink written 3.0 years ago by GenoMax92k

I've added the options:

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file.fastq.gz  --runThreadN 16 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --alignIntronMax 500000 --alignMatesGapMax 1000000 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

But the results are the same (any other parameter to include?) ... so I will try with BBDuk...what options do I have if those are rRNA?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Lila M 860
2

You can to ignore those reads (they won't be counted by featureCounts/STAR, if an entry is not in your GTF and/or since they are multi-mapped). That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k
1

There are more than 20 (though I don't know how many more) rRNA positions on the assembled human genome. I got rRNA mappings when I set --outFilterMultimapNmax 200, though it is possible you will get them with less than 200. Did you check for rRNA contamination with BBDuk, anyway?

P.S.: I completely agree with genomax.

That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by h.mon31k

yes, I did

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

been human_ribosomal.fa

and I get:

Input:                      49078760 reads      3667028636 bases.
Contaminants:               38486994 reads (78.42%)     2895991336 bases (78.97%)
Total Removed:              38486994 reads (78.42%)     2895991336 bases (78.97%)
Result:                     10591766 reads (21.58%)     771037300 bases (21.03%)

So for sure they are rRNA right?

ADD REPLYlink written 3.0 years ago by Lila M 860

Ouch! Are all samples like this or just this one?

ADD REPLYlink written 3.0 years ago by GenoMax92k

All of them I think :( How bad is to work with ~10.000.000 reads?

ADD REPLYlink written 3.0 years ago by Lila M 860
1

No harm in giving it a try. Don't expect to find anything significant if the difference is expected to be subtle.

BTW: Were these samples not ribo-depleted or that process did not work well? Do all samples have similar level of rRNA?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k

Yes, all of them have a similar contamination level because of the kit (I think). One more question, regarding the command line

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

which is better, working with the file.fastq.gz (for STAR/salmon) or use directly nonribo.fa to be sure that all the contamination has been removed?

ADD REPLYlink written 3.0 years ago by Lila M 860
1

Just to make sure. This isn't an experiment about rRNA correct? Otherwise we would be wrong.

If you have already done STAR alignments you could ignore rRNA counts since the alignment % more or less appears to match bbduk's evaluation. You could use nonribo.fa, if you were going to redo the alignments.

ADD REPLYlink written 3.0 years ago by GenoMax92k

Nop, is an experiment about nascent RNA :) . Thank you very much for all the information and all the tips that have been very very useful!!!

ADD REPLYlink written 3.0 years ago by Lila M 860

That may need to include rRNA. If you are just analyzing the data you may want to check with the folks who generated the data.

ADD REPLYlink written 3.0 years ago by GenoMax92k

Yes, they agreed about rRNA contamination as well....

ADD REPLYlink written 3.0 years ago by Lila M 860
1

Then you are all set for the next step.

ADD REPLYlink written 3.0 years ago by GenoMax92k

Regarding this answer, apparently STAR has not ignore the rRNA, because after generating the bam file and run samtools view Aligned.sortedByCoord.out.bam | cut -f1 |sort | uniq |wc -l I get a total of 44795631 that I assume are the mapped reads. So STAR is not ignoring at all the rRNA, rigth?

ADD REPLYlink written 3.0 years ago by Lila M 860

STAR outputs unmapped reads to the bam file, and your samtools view is showing all reads from the bam.

ADD REPLYlink written 3.0 years ago by h.mon31k

so how can I get only the mapped reads?

ADD REPLYlink written 3.0 years ago by Lila M 860
1

STAR appears to include unmapped reads in the BAM (unless --outReadsUnmapped option is used). So follow @h.mon's suggestions below to filter your files, if you need to separate the mapped reads.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k

I think so, because I've used the Genome sequence (GRCh38.p10) and Comprehensive gene annotation (gencode.v27.annotation.gtf)

ADD REPLYlink written 3.0 years ago by Lila M 860
1

See my edits to the post above.

You could remove the lines for rRNA from the GTF file when you do the counting with featureCounts or if you used STAR to get counts directly then the relevant lines for rRNA counts could be deleted from the count matrix.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k

I think that is easier if I use the non_ribosomal.fastq.gz output from bbduk to do the mapping and perform the downstream analysis.

ADD REPLYlink written 3.0 years ago by Lila M 860
1

That would work. Decontaminated datasets are much smaller so the alignments would complete faster. Consider using the option for unmapped reads if you don't want them in your alignments but they are not really doing any harm, as far as counting goes.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k

I've tried, but as I am using the whole genome, the rRNA should be included in the gtf because the size is exactly the same for both bam files.

ADD REPLYlink written 3.0 years ago by Lila M 860
1

If you use decontaminated files (non_ribo.fq.gz) then you should have only 25% of the original reads in them so the resulting BAM's should be smaller, even when unaligned reads are included. rRNA should not matter now since most reads would not align to it. Am I missing something?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k

sorry, my mistake. I've run STAR with --outReadsUnmapped in the original fastq.gz file, and the result is the same I get when run STAR without that parameter, so this is why I'm thinking that the genome that I am using for mapping should include rRNA. I hope a much smaller bam file for non_ribo.fq.gz

ADD REPLYlink written 3.0 years ago by Lila M 860
1

That is not making sense in terms of size of the BAM. They should be smaller with decontaminated data. Did you get nothing in the unmapped files? That means most (all?) of your decontaminated data maps.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by GenoMax92k
1

With samtools flags, filtering options (-f, -F and -G) and the mighty power of internet search engines:

Question: How To Filter Mapped Reads With Samtools

Question: How to extract all the mapped reads from bwa output

SAM and BAM filtering oneliners

Separating mapped and unmapped reads from libraries

ADD REPLYlink written 3.0 years ago by h.mon31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2267 users visited in the last hour