Question: high % of reads mapped to multiple loci after STAR mapping
0
gravatar for Lila M
21 months ago by
Lila M 790
UK
Lila M 790 wrote:

Hi everybody, I have a set of RNA-seq (single end files). After mapping with STAR (himan genome)

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file1.fastq.gz --runThreadN 16 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

I have a huge % of reads mapped to multiple loci

                          Number of input reads |   49078760
                      Average input read length |   74
                                    UNIQUE READS:
                   Uniquely mapped reads number |   9806908
                        Uniquely mapped reads % |   19.98%
                          Average mapped length |   73.42
                       Number of splices: Total |   1022832
            Number of splices: Annotated (sjdb) |   985106
                       Number of splices: GT/AG |   1005417
                       Number of splices: GC/AG |   6850
                       Number of splices: AT/AC |   496
               Number of splices: Non-canonical |   10069
                      Mismatch rate per base, % |   0.94%
                         Deletion rate per base |   0.02%
                        Deletion average length |   2.33
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.22
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   36919747
             % of reads mapped to multiple loci |   75.23%
        Number of reads mapped to too many loci |   186914
             % of reads mapped to too many loci |   0.38%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   3.97%
                     % of reads unmapped: other |   0.45%

I would be grateful for any suggestions.

rna-seq single-end star • 3.4k views
ADD COMMENTlink modified 21 months ago by Hussain Ather940 • written 21 months ago by Lila M 790
2

First thing to check is if those are rRNA reads.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k

How can I check it?

ADD REPLYlink written 21 months ago by Lila M 790
1

Using BBDuk.

ADD REPLYlink written 21 months ago by h.mon27k
1

You can find entire sequence of human rDNA repeat here. Use it with bbduk.sh.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k
1

Was anything suspicious in the FastQC report? Have you checked for rRNA or other contamination?

For RNA-Seq, I'd use the ENCODE settings described in the STAR manual. This parameter set worked well for me in various experiments.

ADD REPLYlink written 21 months ago by michael.ante3.4k
2
gravatar for Hussain Ather
21 months ago by
Hussain Ather940
National Institutes of Health, Bethesda, MD
Hussain Ather940 wrote:

Reads that map to more than 10 loci are counted as mapping to too many loci. You can change --outFilterMultimapNmax to increase the threshold. Make sure to also increase winAnchorMultimapNmax if you increase it to greater than 50.

ADD COMMENTlink written 21 months ago by Hussain Ather940

And that may be too many to begin with. OP is not trying to make the multi-mapping go away :)

ADD REPLYlink written 21 months ago by genomax71k

I've added the options:

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file.fastq.gz  --runThreadN 16 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --alignIntronMax 500000 --alignMatesGapMax 1000000 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

But the results are the same (any other parameter to include?) ... so I will try with BBDuk...what options do I have if those are rRNA?

ADD REPLYlink modified 21 months ago • written 21 months ago by Lila M 790
2

You can to ignore those reads (they won't be counted by featureCounts/STAR, if an entry is not in your GTF and/or since they are multi-mapped). That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k
1

There are more than 20 (though I don't know how many more) rRNA positions on the assembled human genome. I got rRNA mappings when I set --outFilterMultimapNmax 200, though it is possible you will get them with less than 200. Did you check for rRNA contamination with BBDuk, anyway?

P.S.: I completely agree with genomax.

That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLYlink modified 21 months ago • written 21 months ago by h.mon27k

yes, I did

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

been human_ribosomal.fa

and I get:

Input:                      49078760 reads      3667028636 bases.
Contaminants:               38486994 reads (78.42%)     2895991336 bases (78.97%)
Total Removed:              38486994 reads (78.42%)     2895991336 bases (78.97%)
Result:                     10591766 reads (21.58%)     771037300 bases (21.03%)

So for sure they are rRNA right?

ADD REPLYlink written 21 months ago by Lila M 790

Ouch! Are all samples like this or just this one?

ADD REPLYlink written 21 months ago by genomax71k

All of them I think :( How bad is to work with ~10.000.000 reads?

ADD REPLYlink written 21 months ago by Lila M 790
1

No harm in giving it a try. Don't expect to find anything significant if the difference is expected to be subtle.

BTW: Were these samples not ribo-depleted or that process did not work well? Do all samples have similar level of rRNA?

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k

Yes, all of them have a similar contamination level because of the kit (I think). One more question, regarding the command line

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

which is better, working with the file.fastq.gz (for STAR/salmon) or use directly nonribo.fa to be sure that all the contamination has been removed?

ADD REPLYlink written 21 months ago by Lila M 790
1

Just to make sure. This isn't an experiment about rRNA correct? Otherwise we would be wrong.

If you have already done STAR alignments you could ignore rRNA counts since the alignment % more or less appears to match bbduk's evaluation. You could use nonribo.fa, if you were going to redo the alignments.

ADD REPLYlink written 21 months ago by genomax71k

Nop, is an experiment about nascent RNA :) . Thank you very much for all the information and all the tips that have been very very useful!!!

ADD REPLYlink written 21 months ago by Lila M 790

That may need to include rRNA. If you are just analyzing the data you may want to check with the folks who generated the data.

ADD REPLYlink written 21 months ago by genomax71k

Yes, they agreed about rRNA contamination as well....

ADD REPLYlink written 21 months ago by Lila M 790
1

Then you are all set for the next step.

ADD REPLYlink written 21 months ago by genomax71k

Regarding this answer, apparently STAR has not ignore the rRNA, because after generating the bam file and run samtools view Aligned.sortedByCoord.out.bam | cut -f1 |sort | uniq |wc -l I get a total of 44795631 that I assume are the mapped reads. So STAR is not ignoring at all the rRNA, rigth?

ADD REPLYlink written 21 months ago by Lila M 790

STAR outputs unmapped reads to the bam file, and your samtools view is showing all reads from the bam.

ADD REPLYlink written 21 months ago by h.mon27k

so how can I get only the mapped reads?

ADD REPLYlink written 21 months ago by Lila M 790
1

STAR appears to include unmapped reads in the BAM (unless --outReadsUnmapped option is used). So follow @h.mon's suggestions below to filter your files, if you need to separate the mapped reads.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k

I think so, because I've used the Genome sequence (GRCh38.p10) and Comprehensive gene annotation (gencode.v27.annotation.gtf)

ADD REPLYlink written 21 months ago by Lila M 790
1

See my edits to the post above.

You could remove the lines for rRNA from the GTF file when you do the counting with featureCounts or if you used STAR to get counts directly then the relevant lines for rRNA counts could be deleted from the count matrix.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k

I think that is easier if I use the non_ribosomal.fastq.gz output from bbduk to do the mapping and perform the downstream analysis.

ADD REPLYlink written 21 months ago by Lila M 790
1

That would work. Decontaminated datasets are much smaller so the alignments would complete faster. Consider using the option for unmapped reads if you don't want them in your alignments but they are not really doing any harm, as far as counting goes.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k

I've tried, but as I am using the whole genome, the rRNA should be included in the gtf because the size is exactly the same for both bam files.

ADD REPLYlink written 21 months ago by Lila M 790
1

If you use decontaminated files (non_ribo.fq.gz) then you should have only 25% of the original reads in them so the resulting BAM's should be smaller, even when unaligned reads are included. rRNA should not matter now since most reads would not align to it. Am I missing something?

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k

sorry, my mistake. I've run STAR with --outReadsUnmapped in the original fastq.gz file, and the result is the same I get when run STAR without that parameter, so this is why I'm thinking that the genome that I am using for mapping should include rRNA. I hope a much smaller bam file for non_ribo.fq.gz

ADD REPLYlink written 21 months ago by Lila M 790
1

That is not making sense in terms of size of the BAM. They should be smaller with decontaminated data. Did you get nothing in the unmapped files? That means most (all?) of your decontaminated data maps.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax71k
1

With samtools flags, filtering options (-f, -F and -G) and the mighty power of internet search engines:

Question: How To Filter Mapped Reads With Samtools

Question: How to extract all the mapped reads from bwa output

SAM and BAM filtering oneliners

Separating mapped and unmapped reads from libraries

ADD REPLYlink written 21 months ago by h.mon27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1670 users visited in the last hour