Question

short RNA reads alignment

0

Entering edit mode

6.0 years ago

debitboro ▴ 260

Hi biostars,

I've short RNA reads (18-47 nt). I want to perform the alignment, for that I've two questions: 1- Can I use STAR to perform the alignment ? if no, what alignment program do you recommend me ? 2- Shall I align the reads against the reference genome (for example Homo_sapiens.GRCh38.dna.primary_assembly.fa), or It is recommended to include other databases for miRNAs (miRBase) for example ?

Thanks in advance

RNA-Seq short reads alignment • 3.9k views

ADD COMMENT • link 6.0 years ago by debitboro ▴ 260

0

Entering edit mode

Segemehl is also a good option, since you can specify the minimum length of the spliced fragment (-Z option) in your case and also increase the accuracy (-A) and decrease evalues for split alignments (-w).

ADD REPLY • link 6.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Hi all,

Thank you for your replies. Finally I've tried STAR and Bowtie (not Bowtie2).

1- I've used STAR by setting the more important parameters as follows:

--outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4  --outFilterMatchNmin 15

Here is the final log file content generated by STAR:

                       Number of input reads |       5845995
                  Average input read length |       27
                                UNIQUE READS:
               Uniquely mapped reads number |       446299
                    Uniquely mapped reads % |       7.63%
                      Average mapped length |       25.10
                   Number of splices: Total |       4043
        Number of splices: Annotated (sjdb) |       622
                   Number of splices: GT/AG |       3930
                   Number of splices: GC/AG |       108
                   Number of splices: AT/AC |       5
           Number of splices: Non-canonical |       0
                  Mismatch rate per base, % |       1.74%
                     Deletion rate per base |       0.08%
                    Deletion average length |       1.36
                    Insertion rate per base |       0.04%
                   Insertion average length |       1.07
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       4346149
         % of reads mapped to multiple loci |       74.34%
    Number of reads mapped to too many loci |       963965
         % of reads mapped to too many loci |       16.49%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |       0.00%
             % of reads unmapped: too short |       0.31%
                 % of reads unmapped: other |       1.22%
                              CHIMERIC READS:
                   Number of chimeric reads |       1503
                        % of chimeric reads |       0.03%

According to this, I got a very low rate of uniquely mapped reads (I used the parameter --outFilterMultimapNmax 10).

How can I improve the uniquely mapped reads rate ?

2 - I've also used Bowtie to map the same sample to the same reference genome. Bowtie didn't output the number of uniquely mapped reads and the number of multimapped reads as STAR did. I checked the log file generated by Bowtie:

    # reads processed: 5845995
    # reads with at least one reported alignment: 4206928 (71.96%)
    # reads that failed to align: 1639067 (28.04%)
    Reported 98304317 alignments to 1 output stream(s)

I want to output the number of uniquely mapped reads and multimapped reads from the sam file generated by bowtie. So I have done the following:

samtools view -Sb  myfile.sam > myfile.bam 
samtools view -F 4 myfile.bam | grep -v "XS:" | wc -l

But it didn't work for me, and I got a wrong number of uniquely mapped reads. I know that those commands work for files generated by bowtie2.

Do you have any solution able to retrieve the number of uniquely mapped, and multimapped reads from the sam file generated by bowtie ?

Thanks a lot

ADD REPLY • link 6.0 years ago by debitboro ▴ 260

score 1 · Answer 1 · 2018-04-12

1

Entering edit mode

6.0 years ago

Antonio R. Franco ★ 5.1k

The database you align with depends on your interest. Choose the best suitable for your purpose since you can map virtually with any sequences you want In relation to size, take a look to this thread. You can do it and even change some settings for your interest

ADD COMMENT • link 6.0 years ago by Antonio R. Franco ★ 5.1k

score 1 · Answer 2 · 2018-04-12

1

Entering edit mode

6.0 years ago

Buffo ★ 2.4k

For miRNA-seq reads I would recommend Bowtie (Not bowtie2) because it is more sensitive mapping short reads.

ADD COMMENT • link 6.0 years ago by Buffo ★ 2.4k

score 0 · Answer 3 · 2018-04-12

0

Entering edit mode

6.0 years ago

GenoMax 141k

For small RNA reads you want to use a program that does ungapped alignments. You could use bbmap.sh from BBMap suite with following options ambig=all vslow perfectmode maxsites=1000. If you are not interested in identifying new RNA's then you could align to miRBase.

ADD COMMENT • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax,

Thank you for your suggestion, I've used BBMap with the values of parameters as you suggested, and I got the following results (I used another sample different from the one used for STAR alignment, but both of them are generated using the same protocol):

   ------------------   Results   ------------------

  Genome:                 1
  Key Length:             13
  Max Indel:              0
  Minimum Score Ratio:    1.0
  Mapping Mode:           perfect
  Reads Used:             4935550 (134428109 bases)

  Mapping:                979.437 seconds.
  Reads/sec:              5039.17
  kBases/sec:             137.25


  Read 1 data:            pct reads       num reads       pct bases          num bases

  mapped:                  54.1572%         2672955        50.0876%           67331779
  unambiguous:              4.5721%          225659         5.6734%            7626610
  ambiguous:               49.5851%         2447296        44.4142%           59705169
  low-Q discards:           0.0335%            1652         0.0394%              52966

  perfect best site:       54.1572%         2672955        50.0876%           67331779
  semiperfect site:        54.1572%         2672955        50.0876%           67331779

  Match Rate:                   NA               NA       100.0000%           67331779
  Error Rate:               0.0000%               0         0.0000%                  0
  Sub Rate:                 0.0000%               0         0.0000%                  0
  Del Rate:                 0.0000%               0         0.0000%                  0
  Ins Rate:                 0.0000%               0         0.0000%                  0
  N Rate:                   0.0000%               0         0.0000%                  0


  Total time:             2169.841 seconds.

The multimapped (ambiguous) reads are still present with high rate of 49.5851%, and only 4.5721% of uniquely mapped reads !!!

ADD REPLY • link 5.9 years ago by debitboro ▴ 260

0

Entering edit mode

What database did you align to? Are you sure your reads are completely free of extraneous sequence?

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

I mapped my reads against Human genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa). Did you mean my reads are contaminated by rRNA sequences ?

ADD REPLY • link 5.9 years ago by debitboro ▴ 260

0

Entering edit mode

I suggest that you align to miRBase. You can download miRBase sequences here. Remember to change the U to T after you download miRBase data before using BBMap to align.

Depending on type of small RNA kit used you may need to follow kit specific recommendations to remove adapters and other extraneous sequence. Or did you not use a special kit for small RNA prep?

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

Even, I'm working on mirnas. And when I align the mirna reads to the mirbase mature file by SHRiMP, the mapping percentage is very low (2.77%). Is this because the miRBase mature file contains U instead of T? If yes, then how can we solve this?

ADD REPLY • link 5.5 years ago by glady ▴ 320

0

Entering edit mode

Change the U's in miRBase fasta by doing something like sed 's/U/T/g' miRBase.fa > new.fa. Then reindex and remap. Make sure you also remove any adpater sequences from your input data.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you. I have another question, what would be better? aligning the mirna reads to mature.fa or to hairpin.fa? And to get the read counts which file should we use in HTSeq? Since, HtSeq requires an gtf file, while the file from mirbase is a gff file.

ADD REPLY • link 5.5 years ago by glady ▴ 320