STAR Unmapped reads
0
0
Entering edit mode
6.0 years ago

Hi all,

I am newbie working on RNA-Seq analysis. I have samples processed using Illumina RNA Exome and Illumina TrueSEq librray protocols Paired end. I initially dif the qulaity control and I dont see any adpater contamination. I trimmed first 15 random reads and performed alignment step using STAR aligner. The samples are human so I aligned against hg38 reference genome. But the alignment is low. I have got around 64% uniquely mapped reads. I am not sure if this is rRNA contamination considering that the two library protocols lack the rRNA depletion step. I was trying to output the unmapped reads using OutReadsUnmapped Fastx. The output I get when I use the above option is unmapped.out.mate1 and unmapped.out.mate2. I am not sure if these are sam files or bam files or fastq files. From the manual I see that you get either fasta or fastq files. But I just see unmapped.out.mate1 and unamapped.out.mate2. I am trying to run blast on these unmapped reads to see if anything matches. Could somebody help me with converting the unmapped.out.mate1 file to fastq file?

Thanks in advance.

Best, Prat

RNA-Seq alignment unmapped reads STAR • 6.2k views
ADD COMMENT
0
Entering edit mode

I trimmed first 15 random reads

Please don't do that. You are throwing away good data.

that the two library protocols lack the rRNA depletion step

Does the genome used to make STAR index have rDNA repeat in it?

Have you done a head -8 unmapped.out.mate1? That file may already be fastq reads.

ADD REPLY
0
Entering edit mode

Hi Devon, Here is my fastqc report. The input read length is 2*150bp length. I am unable to paste the fasqc image here. The mean quality value starts from 30 and goes to 40 and it is a straight line from there without any platos. I see a plato from 30 to 40 fro 10 bp. So, I have headcropped the 10 reads.

ADD REPLY
0
Entering edit mode

Don't head crop, leave it as is.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Look at the first few lines of the unmapped files with head. Then you'll know if they're fastq or fasta or something else.

What was the overall alignment rate? Usually it's something like 98%, so there's no point in blasting the small percent of junk that didn't align.

ADD REPLY
1
Entering edit mode

I am pasting my STAR output here.

Started job on |    4/22/18 3:57
Started mapping on |    4/22/18 3:58
Finished on |   4/22/18 4:00
Mapping speed, Million of reads per hour |  542.18

Number of input reads | 16867704
 Average input read length |    209
UNIQUE READS:   
Uniquely mapped reads number |  10989950
Uniquely mapped reads % |   65.15%
                          Average mapped length |   210.66
                       Number of splices: Total |   9111815
            Number of splices: Annotated (sjdb) |   0
                       Number of splices: GT/AG |   9041706
                       Number of splices: GC/AG |   50166
                       Number of splices: AT/AC |   4470
               Number of splices: Non-canonical |   15473
                      Mismatch rate per base, % |   0.19%
                         Deletion rate per base |   0.00%
                        Deletion average length |   1.78
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.42
                             MULTI-MAPPING READS:   
        Number of reads mapped to multiple loci |   2526914
             % of reads mapped to multiple loci |   14.98%
        Number of reads mapped to too many loci |   276783
             % of reads mapped to too many loci |   1.64%
                                  UNMAPPED READS:   
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   16.49%
                     % of reads unmapped: other |   1.73%
                                  CHIMERIC READS:   
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

I was wondering if number of reads matched to multiple loci and % of reads unmapped too short is high.

ADD REPLY
0
Entering edit mode

The multimapping rate is quite normal. The too-short rate seems high, though I expect that those are junk reads. We've had a few machines start spitting out low complexity sequence that will sort of align if you soft-clip it enough (but it's junk, so it's best not to). Have a look at a few reads and see what they look like. If you blast them and they turn out to be random sequencer junk then don't worry about it.

ADD REPLY
0
Entering edit mode

Thank you so much for your feedback and comments. I would try to align my fastq files without headcropping to the reference genome if that could improve mapping and also not loosing good data. I have tried head command on out.mate1 files and those are fastq files. I will run blast on these fastq files and see if those are any random sequencer junk files.

ADD REPLY
0
Entering edit mode

Not sure if you are going to be able to get a big improvement.

% of reads unmapped: too short |   16.49%

Reads that are not mapping are too short.

ADD REPLY
0
Entering edit mode

I am not sure how to deal with % of reads unmapped: too short | 16.49%. I am wondering if there is any parameter in the STAR aligner that I could use to improve the % of unmapped reads.

ADD REPLY
0
Entering edit mode

There is such an option, but you should see if it's worth while to map those first.

ADD REPLY
0
Entering edit mode

I am just wondering if i be liberal with the STAR parameters would I be able to improve my mapping? But again I understand that if the quantity used or quality of RNA used is bad that might be causing this. I have samples processed using different library protocol kits (Illumina RNA Exome, Lexogen QunatSeq 3 prime sequencing). The STAR alignment from Lexogen QuantSeq was very low just 45% with % of reads unmapped : other | 27%. The input read length for QuantSeq is 75bp and for Illumina is 150bp. My fastqc results look good. But the alignment is very low.

ADD REPLY
0
Entering edit mode

Ah Lexogen, that explains things. You're not going to get better then, those libraries produce a fair amount of junk sequence.

ADD REPLY

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6