Question: short RNA reads alignment
0
gravatar for debitboro
12 months ago by
debitboro110
Belgium
debitboro110 wrote:

Hi biostars,

I've short RNA reads (18-47 nt). I want to perform the alignment, for that I've two questions: 1- Can I use STAR to perform the alignment ? if no, what alignment program do you recommend me ? 2- Shall I align the reads against the reference genome (for example Homo_sapiens.GRCh38.dna.primary_assembly.fa), or It is recommended to include other databases for miRNAs (miRBase) for example ?

Thanks in advance

short reads rna-seq alignment • 753 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by debitboro110

Segemehl is also a good option, since you can specify the minimum length of the spliced fragment (-Z option) in your case and also increase the accuracy (-A) and decrease evalues for split alignments (-w).

ADD REPLYlink modified 12 months ago • written 12 months ago by Rohit1.3k

Hi all,

Thank you for your replies. Finally I've tried STAR and Bowtie (not Bowtie2).

1- I've used STAR by setting the more important parameters as follows:

--outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4  --outFilterMatchNmin 15

Here is the final log file content generated by STAR:

                       Number of input reads |       5845995
                  Average input read length |       27
                                UNIQUE READS:
               Uniquely mapped reads number |       446299
                    Uniquely mapped reads % |       7.63%
                      Average mapped length |       25.10
                   Number of splices: Total |       4043
        Number of splices: Annotated (sjdb) |       622
                   Number of splices: GT/AG |       3930
                   Number of splices: GC/AG |       108
                   Number of splices: AT/AC |       5
           Number of splices: Non-canonical |       0
                  Mismatch rate per base, % |       1.74%
                     Deletion rate per base |       0.08%
                    Deletion average length |       1.36
                    Insertion rate per base |       0.04%
                   Insertion average length |       1.07
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       4346149
         % of reads mapped to multiple loci |       74.34%
    Number of reads mapped to too many loci |       963965
         % of reads mapped to too many loci |       16.49%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |       0.00%
             % of reads unmapped: too short |       0.31%
                 % of reads unmapped: other |       1.22%
                              CHIMERIC READS:
                   Number of chimeric reads |       1503
                        % of chimeric reads |       0.03%

According to this, I got a very low rate of uniquely mapped reads (I used the parameter --outFilterMultimapNmax 10).

How can I improve the uniquely mapped reads rate ?

2 - I've also used Bowtie to map the same sample to the same reference genome. Bowtie didn't output the number of uniquely mapped reads and the number of multimapped reads as STAR did. I checked the log file generated by Bowtie:

    # reads processed: 5845995
    # reads with at least one reported alignment: 4206928 (71.96%)
    # reads that failed to align: 1639067 (28.04%)
    Reported 98304317 alignments to 1 output stream(s)

I want to output the number of uniquely mapped reads and multimapped reads from the sam file generated by bowtie. So I have done the following:

samtools view -Sb  myfile.sam > myfile.bam 
samtools view -F 4 myfile.bam | grep -v "XS:" | wc -l

But it didn't work for me, and I got a wrong number of uniquely mapped reads. I know that those commands work for files generated by bowtie2.

Do you have any solution able to retrieve the number of uniquely mapped, and multimapped reads from the sam file generated by bowtie ?

Thanks a lot

ADD REPLYlink written 12 months ago by debitboro110
1
gravatar for Antonio R. Franco
12 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

The database you align with depends on your interest. Choose the best suitable for your purpose since you can map virtually with any sequences you want In relation to size, take a look to this thread. You can do it and even change some settings for your interest

ADD COMMENTlink written 12 months ago by Antonio R. Franco4.0k
1
gravatar for Buffo
12 months ago by
Buffo1.5k
Buffo1.5k wrote:

For miRNA-seq reads I would recommend Bowtie (Not bowtie2) because it is more sensitive mapping short reads.

ADD COMMENTlink written 12 months ago by Buffo1.5k
0
gravatar for genomax
12 months ago by
genomax65k
United States
genomax65k wrote:

For small RNA reads you want to use a program that does ungapped alignments. You could use bbmap.sh from BBMap suite with following options ambig=all vslow perfectmode maxsites=1000. If you are not interested in identifying new RNA's then you could align to miRBase.

ADD COMMENTlink modified 12 months ago • written 12 months ago by genomax65k

Hi genomax,

Thank you for your suggestion, I've used BBMap with the values of parameters as you suggested, and I got the following results (I used another sample different from the one used for STAR alignment, but both of them are generated using the same protocol):

   ------------------   Results   ------------------

  Genome:                 1
  Key Length:             13
  Max Indel:              0
  Minimum Score Ratio:    1.0
  Mapping Mode:           perfect
  Reads Used:             4935550 (134428109 bases)

  Mapping:                979.437 seconds.
  Reads/sec:              5039.17
  kBases/sec:             137.25


  Read 1 data:            pct reads       num reads       pct bases          num bases

  mapped:                  54.1572%         2672955        50.0876%           67331779
  unambiguous:              4.5721%          225659         5.6734%            7626610
  ambiguous:               49.5851%         2447296        44.4142%           59705169
  low-Q discards:           0.0335%            1652         0.0394%              52966

  perfect best site:       54.1572%         2672955        50.0876%           67331779
  semiperfect site:        54.1572%         2672955        50.0876%           67331779

  Match Rate:                   NA               NA       100.0000%           67331779
  Error Rate:               0.0000%               0         0.0000%                  0
  Sub Rate:                 0.0000%               0         0.0000%                  0
  Del Rate:                 0.0000%               0         0.0000%                  0
  Ins Rate:                 0.0000%               0         0.0000%                  0
  N Rate:                   0.0000%               0         0.0000%                  0


  Total time:             2169.841 seconds.

The multimapped (ambiguous) reads are still present with high rate of 49.5851%, and only 4.5721% of uniquely mapped reads !!!

ADD REPLYlink modified 11 months ago • written 11 months ago by debitboro110

What database did you align to? Are you sure your reads are completely free of extraneous sequence?

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax65k

I mapped my reads against Human genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa). Did you mean my reads are contaminated by rRNA sequences ?

ADD REPLYlink written 11 months ago by debitboro110

I suggest that you align to miRBase. You can download miRBase sequences here. Remember to change the U to T after you download miRBase data before using BBMap to align.

Depending on type of small RNA kit used you may need to follow kit specific recommendations to remove adapters and other extraneous sequence. Or did you not use a special kit for small RNA prep?

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax65k

Even, I'm working on mirnas. And when I align the mirna reads to the mirbase mature file by SHRiMP, the mapping percentage is very low (2.77%). Is this because the miRBase mature file contains U instead of T? If yes, then how can we solve this?

ADD REPLYlink modified 6 months ago • written 6 months ago by glady240

Change the U's in miRBase fasta by doing something like sed 's/U/T/g' miRBase.fa > new.fa. Then reindex and remap. Make sure you also remove any adpater sequences from your input data.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax65k

Thank you. I have another question, what would be better? aligning the mirna reads to mature.fa or to hairpin.fa? And to get the read counts which file should we use in HTSeq? Since, HtSeq requires an gtf file, while the file from mirbase is a gff file.

ADD REPLYlink modified 6 months ago • written 6 months ago by glady240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1696 users visited in the last hour