Question: STAR Unmapped reads
0
gravatar for prathyushareddy87
11 months ago by
prathyushareddy870 wrote:

Hi all,

I am newbie working on RNA-Seq analysis. I have samples processed using Illumina RNA Exome and Illumina TrueSEq librray protocols Paired end. I initially dif the qulaity control and I dont see any adpater contamination. I trimmed first 15 random reads and performed alignment step using STAR aligner. The samples are human so I aligned against hg38 reference genome. But the alignment is low. I have got around 64% uniquely mapped reads. I am not sure if this is rRNA contamination considering that the two library protocols lack the rRNA depletion step. I was trying to output the unmapped reads using OutReadsUnmapped Fastx. The output I get when I use the above option is unmapped.out.mate1 and unmapped.out.mate2. I am not sure if these are sam files or bam files or fastq files. From the manual I see that you get either fasta or fastq files. But I just see unmapped.out.mate1 and unamapped.out.mate2. I am trying to run blast on these unmapped reads to see if anything matches. Could somebody help me with converting the unmapped.out.mate1 file to fastq file?

Thanks in advance.

Best, Prat

ADD COMMENTlink written 11 months ago by prathyushareddy870

I trimmed first 15 random reads

Please don't do that. You are throwing away good data.

that the two library protocols lack the rRNA depletion step

Does the genome used to make STAR index have rDNA repeat in it?

Have you done a head -8 unmapped.out.mate1? That file may already be fastq reads.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax64k

Hi Devon, Here is my fastqc report. The input read length is 2*150bp length. I am unable to paste the fasqc image here. The mean quality value starts from 30 and goes to 40 and it is a straight line from there without any platos. I see a plato from 30 to 40 fro 10 bp. So, I have headcropped the 10 reads.

ADD REPLYlink written 11 months ago by prathyushareddy870

Don't head crop, leave it as is.

ADD REPLYlink written 11 months ago by Devon Ryan88k

prathyushareddy87 : See How to add images to a Biostars post

ADD REPLYlink written 11 months ago by genomax64k

Look at the first few lines of the unmapped files with head. Then you'll know if they're fastq or fasta or something else.

What was the overall alignment rate? Usually it's something like 98%, so there's no point in blasting the small percent of junk that didn't align.

ADD REPLYlink written 11 months ago by Devon Ryan88k

I am pasting my STAR output here.

Started job on |    4/22/18 3:57
Started mapping on |    4/22/18 3:58
Finished on |   4/22/18 4:00
Mapping speed, Million of reads per hour |  542.18

Number of input reads | 16867704
 Average input read length |    209
UNIQUE READS:   
Uniquely mapped reads number |  10989950
Uniquely mapped reads % |   65.15%
                          Average mapped length |   210.66
                       Number of splices: Total |   9111815
            Number of splices: Annotated (sjdb) |   0
                       Number of splices: GT/AG |   9041706
                       Number of splices: GC/AG |   50166
                       Number of splices: AT/AC |   4470
               Number of splices: Non-canonical |   15473
                      Mismatch rate per base, % |   0.19%
                         Deletion rate per base |   0.00%
                        Deletion average length |   1.78
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.42
                             MULTI-MAPPING READS:   
        Number of reads mapped to multiple loci |   2526914
             % of reads mapped to multiple loci |   14.98%
        Number of reads mapped to too many loci |   276783
             % of reads mapped to too many loci |   1.64%
                                  UNMAPPED READS:   
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   16.49%
                     % of reads unmapped: other |   1.73%
                                  CHIMERIC READS:   
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

I was wondering if number of reads matched to multiple loci and % of reads unmapped too short is high.

ADD REPLYlink modified 11 months ago by Devon Ryan88k • written 11 months ago by prathyushareddy870

The multimapping rate is quite normal. The too-short rate seems high, though I expect that those are junk reads. We've had a few machines start spitting out low complexity sequence that will sort of align if you soft-clip it enough (but it's junk, so it's best not to). Have a look at a few reads and see what they look like. If you blast them and they turn out to be random sequencer junk then don't worry about it.

ADD REPLYlink written 11 months ago by Devon Ryan88k

Thank you so much for your feedback and comments. I would try to align my fastq files without headcropping to the reference genome if that could improve mapping and also not loosing good data. I have tried head command on out.mate1 files and those are fastq files. I will run blast on these fastq files and see if those are any random sequencer junk files.

ADD REPLYlink written 11 months ago by prathyushareddy870

Not sure if you are going to be able to get a big improvement.

% of reads unmapped: too short |   16.49%

Reads that are not mapping are too short.

ADD REPLYlink written 11 months ago by genomax64k

I am not sure how to deal with % of reads unmapped: too short | 16.49%. I am wondering if there is any parameter in the STAR aligner that I could use to improve the % of unmapped reads.

ADD REPLYlink written 11 months ago by prathyushareddy870

There is such an option, but you should see if it's worth while to map those first.

ADD REPLYlink written 11 months ago by Devon Ryan88k

I am just wondering if i be liberal with the STAR parameters would I be able to improve my mapping? But again I understand that if the quantity used or quality of RNA used is bad that might be causing this. I have samples processed using different library protocol kits (Illumina RNA Exome, Lexogen QunatSeq 3 prime sequencing). The STAR alignment from Lexogen QuantSeq was very low just 45% with % of reads unmapped : other | 27%. The input read length for QuantSeq is 75bp and for Illumina is 150bp. My fastqc results look good. But the alignment is very low.

ADD REPLYlink written 11 months ago by prathyushareddy870

Ah Lexogen, that explains things. You're not going to get better then, those libraries produce a fair amount of junk sequence.

ADD REPLYlink written 11 months ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1013 users visited in the last hour