Question: Strand specific protocol for bowtie2 alignment
2
gravatar for nalandaatmi
3.7 years ago by
nalandaatmi60
United States
nalandaatmi60 wrote:

Dear All,

I have a doubt, I am working on strand specific protocol (fr-firststrand). When aligning the reads using bowtie2, do I need to explicitly mention as fr-firststrand.

If I am not mentioning in my bowtie2 command, will it affect my alignment result?

From Bowtie2 manual:


 
--fr/--rf/--ff - The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if --fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints (-I and -X) are met, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 and all other constraints are met, that too is valid. --rf likewise requires that an upstream mate1 be reverse-complemented and a downstream mate2 be forward-oriented. --ff requires both an upstream mate 1 and a downstream mate 2 to be forward-oriented. Default: --fr (appropriate for Illumina's Paired-end Sequencing Assay).

 

rna-seq alignment bowtie2 rnaseq • 3.7k views
ADD COMMENTlink written 3.7 years ago by nalandaatmi60

Are you aligning against the transcriptome? If not, don't use bowtie2, use an aligner meant for RNAseq (hisat, STAR, etc.).

ADD REPLYlink written 3.7 years ago by Devon Ryan91k

Just I am aligning against human ribosomal RNA genes.

This is the bowtie2 output

27921333 reads; of these:
  27921333 (100.00%) were paired; of these:
    27671472 (99.11%) aligned concordantly 0 times
    239838 (0.86%) aligned concordantly exactly 1 time
    10023 (0.04%) aligned concordantly >1 times
    ----
    27671472 pairs aligned concordantly 0 times; of these:
      59349 (0.21%) aligned discordantly 1 time
    ----
    27612123 pairs aligned 0 times concordantly or discordantly; of these:
      55224246 mates make up the pairs; of these:
        55105160 (99.78%) aligned 0 times
        58050 (0.11%) aligned exactly 1 time
        61036 (0.11%) aligned >1 times
1.32% overall alignment rate

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by nalandaatmi60

Was your protocol only including rRNA or was it normal RNA Sequencing data?If it is meant for rRNA, then you need to really check your analysis for your alignment rate is really low... However, if you were performing polyA capture or rRNA reduction RNA Seq, then you should expect to have a low alignment rate on the rRNA.

ADD REPLYlink written 3.7 years ago by Sam2.4k

Actually we are using a new sequencing kit for RNAseq, so the kit is supposed to remove 28S, 18S, 5.8S and 5S rRNA (RNA28S, RNA18S, RNA5-8S and RNA5S genes) in addition to 12S and 16S mitochondrial rRNA (MT-RNR1 and MT-RNR2).

So what I have done is, I collected all the rRNA genes from ensembl (565 fasta sequences were in the ensembl fasta file). Then I have created the index files for rRNA fasta file. Finally using bowtie2, I have aligned my reads in fastq format against these rRNA index files. Above is the bowtie2 output I received after running the following command.

$ bowtie2 -N 0 -L 15 --fr -x Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/rRNA_index/rRNA_mtRNA -1 R1.fastq   -2 R2.fastq -S output.sam

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by nalandaatmi60
1

Depending on the species, the real rRNA genes may not be in the fasta file. This is the case, for example, in the mouse reference genome.

ADD REPLYlink written 3.7 years ago by Devon Ryan91k

Hi Devon,

As per your first comment, you suggested me to use hisat or STAR for RNAseq. I used TopHat. 

When I downloaded the rRNA genes from ensembl, I got 565 fasta sequences. Some of the gene associate name looks like below.

5_8S_rRNA rRNA
5_8S_rRNA rRNA……...
5S_rRNA rRNA
5S_rRNA rRNA
5S_rRNA rRNA……...
MT-RNR1 Mt_rRNA
MT-RNR2 Mt_rRNA
RNA5-8S5 rRNA
RNA5-8S5 rRNA……….
RNA5-8SP2 rRNA
RNA5-8SP3 rRNA
RNA5-8SP4 rRNA………..
RNA5S1 rRNA
RNA5S2 rRNA
RNA5S3 rRNA
RNA5S4 rRNA……..
RNA5SP523 rRNA

5S_rRNA - 28 sequences

5_8S_rRNA - 3 sequences

RNA5SP - 500 sequences

RNA5S   - 17 sequences (RNA5S1-17)

RNA5-8SP - 6 sequences

RNA5-8S5 - 8 sequences

Actually we are using a new sequencing kit for RNAseq, so the kit is supposed to remove 28S, 18S, 5.8S and 5S rRNA (RNA28S, RNA18S, RNA5-8S and RNA5S genes) in addition to 12S and 16S mitochondrial rRNA (MT-RNR1 and MT-RNR2). 

​Do I need to consider only 31 highlighted sequences or the entire list for alignment?

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by nalandaatmi60
1

You might want to keep the Mt_rRNA too (or at least one copy). You probably need to download the 48S sequence from NCBI, assuming that's appropriate for your organism.

ADD REPLYlink written 3.7 years ago by Devon Ryan91k

What will happen if you just align your reads to human genome reference using STAR and use HTSeq to count the read coverage at your target rRNA regions? Will you have better coverage? And will the coverage for your rRNA be better?

If the reads still doesn't align, there might be some problem with the protocol...

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Sam2.4k

Hi Sam,

First I carried out alignment of my entire reads against human genome reference from UCSC/hg19/Sequence/genome using Tophat. Then I used the accepted_hits.bam as input along with human GTF file for htseq_count to count the read coverage for all genes. From the list of ~24,000 genes, I couldn't find rRNA genes in that list. 

Then I checked biostar, I found some posts, where they suggested to collect all the rRNA genes from ensembl and create a bowtie2 index. Also they suggested to compare my reads to the rRNA_mtrRNA bowtie2 index created. The overall alignment rate from bowtie2 alignment provides the percentage of reads mapped to rRNA genes. So I thought 1.32% from above post is the percentage of reads mapped to rRNA genes.

But I dint use STAR tool. So do you want me to replace STAR and TopHat and repeat the process.

 

ADD REPLYlink written 3.7 years ago by nalandaatmi60
1

If that is what you have done, then no, you don't need to repeat it with STAR, that is just another aligner. 

What is your alignment rate for the TopHat alignment?

Instead of using the human GTF file, go to biomart and select Human genome. Under filter, there is a field call gene type where you can select the rRNA. That will help you to generate the gtf file for rRNA for download. Try to use that with HTSeq and see if you can get the desired results

Also, make sure the sequence is presented in your fasta file as Devon suggested

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Sam2.4k

Hi Sam,

Human Reads SampleA
rRNA & mtRNA genes   1.32%
TopHat-Left reads-Input 27,921,333  
TopHat-Left reads-Mapped 14,609,588 52.30%
TopHat-Left reads-Mapped-Multiple 1,399,754 9.60%
TopHat-Right reads-Input 27,921,333  
TopHat-Right reads-Mapped 12,000,823 43.00%
TopHat-Right reads-Mapped-Multiple 1,103,760 9.20%
TopHat- overall read mapping rate   47.70%
TopHat-Aligned Pairs 11,123,082  
TopHat-Aligned Pairs-Multiple 1,073,764 9.70%
TopHat-Aligned Pairs-Discordant 385,912 3.50%
TopHat-concordant pair alignment rate   38.50%

I am working on biomart to generate rRNAgenes.gtf file now. Kindly see my reply for Devon's post below. I have a doubt that, do I need to consider all 565 rRNA sequences or just 31 highlighted rRNA genes for building GTF?

 

ADD REPLYlink written 3.7 years ago by nalandaatmi60

Sorry about the late reply, was busy writing my thesis lately. Basically it all depends on what you want to test.

I try to read your whole post again and I think I am rather lost in what you really want to achieve.

At the beginning of your question, you were asking about the strand specificity from bowtie2, however, in the subsequent replies, you said you have done TopHat. So for TopHat, yes, if you have perform strand specific RNA Sequencing, specifying the strand will help you go a long way. If I remember correctly, the dUTP protocol will be fr for TopHat. 

Then in the comment, you said your protocol removes rRNA. If that is the case, your protocol is doing a great job in making sure most of your reads does not align to the rRNA. If you want the percentage of reads align to the rRNA merely as an index to the quality of the library, you can then use the unaligned read from your original alignment to the human genome and try to align that to the rRNA genome. In that case, you will want to consider most if not all the rRNA sequence and see how they align. To total number of reads aligned to the rRNA / total number of reads in your fastq X 100% will be the % reads to rRNA. 

Finally, from my experience, the GTF file does differ slightly from different source and I usually prefer the one from Ensembl because in the early version of HTSeq (I am not sure if they have fixed it), they are more compatible to the ensembl gtf. You can simply download the full gtf from Ensembl, I believe they do contain the rRNA regions e.g. "ENSG00000222952"

However, if that is not what you want, you might have to rephrase your question.

ADD REPLYlink written 3.6 years ago by Sam2.4k

Thanks Sam for getting back to me. all the best for your thesis.

Reason for doing TopHat:

First, I carried out TopHat analysis. Then the bam file from TopHat used along with GTF file in HTseq-count, to get the number of reads aligned to each gene. But, from Htseq-count I couldn't get the number of reads for rRNA genes (because human GTF file did not have annotation for rRNA genes)

Secondly from biostar posts, I come to know that I can download the rRNA genes from ensembl and aligned it with my reads using Bowtie2. 

As you mentioned for RNAseq the strand specificity is very important. Does Bowtie2 also requires it (---fr)?

In your second paragraph, you mentioned it correctly. I am trying to get the percentage of rRNA reads as an index to the quality of my library. I am learning NGS tools now,

1) From tophat output, I have a file called unmapped.bam. Do I need to use the reads from this unmapped.bam file? 

2) How do I extract the aligned reads from original fastq file?

3) If I need to align my unaligned reads to rRNA genomes, do I need to consider just the highlighted 31 rRNA sequences in addition to the 45S sequences mentioned by Devon

or 

4) Should I need to consider all 565 rRNA sequences list below?

5S_rRNA - 28 sequences

5_8S_rRNA - 3 sequences

RNA5SP - 500 sequences

RNA5S   - 17 sequences (RNA5S1-17)

RNA5-8SP - 6 sequences

RNA5-8S5 - 8 sequences

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by nalandaatmi60

1. I don't use bowtie2 for RNA Seq so I am not sure about that. However, if you are certain that your reads are stranded, there will be no harm for you to use the --fr option

2. Yes, the unmapped.bam should contain the reads that you want to align to the rRNA genome

3. The most straight forward way will be the get the name of the reads in the bam file and just extract them from the original fastq file. However, there might as well be more efficient methods, you might want to search biostar for more information

4. That depends what you want to do. If you want to ratio of all rRNA, then include everything, If you only want to know the alignment percentage to the specific rRNA, then just align to them. This all depends on your end goal

 

ADD REPLYlink written 3.6 years ago by Sam2.4k

Hi, I have used human hg19 version of genes.GTF file for my tophat alignment. But in biomart, I could select only GRch38.p3.

Does it impact my analysis?

After selecting GRCh38.p3 human dataset, under filter, I enabled Gene type then selected Mt_rRNA and rRNA. Finally under attributes, under Gene,I selected Ensembl Geneid, Ensembl Transcript id, Chromosome name, Gene start, Gene end, Strand, Associated gene name, Gene type. When I selected results tab, either I could download the csv or html file. 

How do I generate the GTF file for rRNA genes.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by nalandaatmi60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1517 users visited in the last hour