Question: high % of reads mapped to multiple loci after STAR mapping
0
gravatar for Lila M
16 months ago by
Lila M 470
UK
Lila M 470 wrote:

Hi everybody, I have a set of RNA-seq (single end files). After mapping with STAR (himan genome)

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file1.fastq.gz --runThreadN 16 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

I have a huge % of reads mapped to multiple loci

                          Number of input reads |   49078760
                      Average input read length |   74
                                    UNIQUE READS:
                   Uniquely mapped reads number |   9806908
                        Uniquely mapped reads % |   19.98%
                          Average mapped length |   73.42
                       Number of splices: Total |   1022832
            Number of splices: Annotated (sjdb) |   985106
                       Number of splices: GT/AG |   1005417
                       Number of splices: GC/AG |   6850
                       Number of splices: AT/AC |   496
               Number of splices: Non-canonical |   10069
                      Mismatch rate per base, % |   0.94%
                         Deletion rate per base |   0.02%
                        Deletion average length |   2.33
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.22
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   36919747
             % of reads mapped to multiple loci |   75.23%
        Number of reads mapped to too many loci |   186914
             % of reads mapped to too many loci |   0.38%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   3.97%
                     % of reads unmapped: other |   0.45%

I would be grateful for any suggestions.

rna-seq single-end star • 2.5k views
ADD COMMENTlink modified 16 months ago by Hussain Ather910 • written 16 months ago by Lila M 470
2

First thing to check is if those are rRNA reads.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k

How can I check it?

ADD REPLYlink written 16 months ago by Lila M 470
1

Using BBDuk.

ADD REPLYlink written 16 months ago by h.mon24k
1

You can find entire sequence of human rDNA repeat here. Use it with bbduk.sh.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k
1

Was anything suspicious in the FastQC report? Have you checked for rRNA or other contamination?

For RNA-Seq, I'd use the ENCODE settings described in the STAR manual. This parameter set worked well for me in various experiments.

ADD REPLYlink written 16 months ago by michael.ante3.2k
2
gravatar for Hussain Ather
16 months ago by
Hussain Ather910
National Institutes of Health, Bethesda, MD
Hussain Ather910 wrote:

Reads that map to more than 10 loci are counted as mapping to too many loci. You can change --outFilterMultimapNmax to increase the threshold. Make sure to also increase winAnchorMultimapNmax if you increase it to greater than 50.

ADD COMMENTlink written 16 months ago by Hussain Ather910

And that may be too many to begin with. OP is not trying to make the multi-mapping go away :)

ADD REPLYlink written 16 months ago by genomax65k

I've added the options:

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file.fastq.gz  --runThreadN 16 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --alignIntronMax 500000 --alignMatesGapMax 1000000 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

But the results are the same (any other parameter to include?) ... so I will try with BBDuk...what options do I have if those are rRNA?

ADD REPLYlink modified 16 months ago • written 16 months ago by Lila M 470
2

You can to ignore those reads (they won't be counted by featureCounts/STAR, if an entry is not in your GTF and/or since they are multi-mapped). That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k
1

There are more than 20 (though I don't know how many more) rRNA positions on the assembled human genome. I got rRNA mappings when I set --outFilterMultimapNmax 200, though it is possible you will get them with less than 200. Did you check for rRNA contamination with BBDuk, anyway?

P.S.: I completely agree with genomax.

That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLYlink modified 16 months ago • written 16 months ago by h.mon24k

yes, I did

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

been human_ribosomal.fa

and I get:

Input:                      49078760 reads      3667028636 bases.
Contaminants:               38486994 reads (78.42%)     2895991336 bases (78.97%)
Total Removed:              38486994 reads (78.42%)     2895991336 bases (78.97%)
Result:                     10591766 reads (21.58%)     771037300 bases (21.03%)

So for sure they are rRNA right?

ADD REPLYlink written 16 months ago by Lila M 470

Ouch! Are all samples like this or just this one?

ADD REPLYlink written 16 months ago by genomax65k

All of them I think :( How bad is to work with ~10.000.000 reads?

ADD REPLYlink written 16 months ago by Lila M 470
1

No harm in giving it a try. Don't expect to find anything significant if the difference is expected to be subtle.

BTW: Were these samples not ribo-depleted or that process did not work well? Do all samples have similar level of rRNA?

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k

Yes, all of them have a similar contamination level because of the kit (I think). One more question, regarding the command line

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

which is better, working with the file.fastq.gz (for STAR/salmon) or use directly nonribo.fa to be sure that all the contamination has been removed?

ADD REPLYlink written 16 months ago by Lila M 470
1

Just to make sure. This isn't an experiment about rRNA correct? Otherwise we would be wrong.

If you have already done STAR alignments you could ignore rRNA counts since the alignment % more or less appears to match bbduk's evaluation. You could use nonribo.fa, if you were going to redo the alignments.

ADD REPLYlink written 16 months ago by genomax65k

Nop, is an experiment about nascent RNA :) . Thank you very much for all the information and all the tips that have been very very useful!!!

ADD REPLYlink written 16 months ago by Lila M 470

That may need to include rRNA. If you are just analyzing the data you may want to check with the folks who generated the data.

ADD REPLYlink written 16 months ago by genomax65k

Yes, they agreed about rRNA contamination as well....

ADD REPLYlink written 16 months ago by Lila M 470
1

Then you are all set for the next step.

ADD REPLYlink written 16 months ago by genomax65k

Regarding this answer, apparently STAR has not ignore the rRNA, because after generating the bam file and run samtools view Aligned.sortedByCoord.out.bam | cut -f1 |sort | uniq |wc -l I get a total of 44795631 that I assume are the mapped reads. So STAR is not ignoring at all the rRNA, rigth?

ADD REPLYlink written 16 months ago by Lila M 470

STAR outputs unmapped reads to the bam file, and your samtools view is showing all reads from the bam.

ADD REPLYlink written 16 months ago by h.mon24k

so how can I get only the mapped reads?

ADD REPLYlink written 16 months ago by Lila M 470
1

STAR appears to include unmapped reads in the BAM (unless --outReadsUnmapped option is used). So follow @h.mon's suggestions below to filter your files, if you need to separate the mapped reads.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k

I think so, because I've used the Genome sequence (GRCh38.p10) and Comprehensive gene annotation (gencode.v27.annotation.gtf)

ADD REPLYlink written 16 months ago by Lila M 470
1

See my edits to the post above.

You could remove the lines for rRNA from the GTF file when you do the counting with featureCounts or if you used STAR to get counts directly then the relevant lines for rRNA counts could be deleted from the count matrix.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k

I think that is easier if I use the non_ribosomal.fastq.gz output from bbduk to do the mapping and perform the downstream analysis.

ADD REPLYlink written 16 months ago by Lila M 470
1

That would work. Decontaminated datasets are much smaller so the alignments would complete faster. Consider using the option for unmapped reads if you don't want them in your alignments but they are not really doing any harm, as far as counting goes.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k

I've tried, but as I am using the whole genome, the rRNA should be included in the gtf because the size is exactly the same for both bam files.

ADD REPLYlink written 16 months ago by Lila M 470
1

If you use decontaminated files (non_ribo.fq.gz) then you should have only 25% of the original reads in them so the resulting BAM's should be smaller, even when unaligned reads are included. rRNA should not matter now since most reads would not align to it. Am I missing something?

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k

sorry, my mistake. I've run STAR with --outReadsUnmapped in the original fastq.gz file, and the result is the same I get when run STAR without that parameter, so this is why I'm thinking that the genome that I am using for mapping should include rRNA. I hope a much smaller bam file for non_ribo.fq.gz

ADD REPLYlink written 16 months ago by Lila M 470
1

That is not making sense in terms of size of the BAM. They should be smaller with decontaminated data. Did you get nothing in the unmapped files? That means most (all?) of your decontaminated data maps.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax65k
1

With samtools flags, filtering options (-f, -F and -G) and the mighty power of internet search engines:

Question: How To Filter Mapped Reads With Samtools

Question: How to extract all the mapped reads from bwa output

SAM and BAM filtering oneliners

Separating mapped and unmapped reads from libraries

ADD REPLYlink written 16 months ago by h.mon24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1613 users visited in the last hour