Strand specific protocol for bowtie2 alignment
0
2
Entering edit mode
7.1 years ago
nalandaatmi ▴ 90

Dear All,

I have a doubt, I am working on strand specific protocol (fr-firststrand). When aligning the reads using bowtie2, do I need to explicitly mention as fr-firststrand?

If I am not mentioning in my bowtie2 command, will it affect my alignment result?

From Bowtie2 manual:

--fr/--rf/--ff - The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if --fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints (-I option and -X option) are met, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 and all other constraints are met, that too is valid. --rf likewise requires that an upstream mate1 be reverse-complemented and a downstream mate2 be forward-oriented. --ff requires both an upstream mate 1 and a downstream mate 2 to be forward-oriented. Default: --fr (appropriate for Illumina's Paired-end Sequencing Assay).

bowtie2 rnaseq alignment RNA-Seq • 6.9k views
0
Entering edit mode

Are you aligning against the transcriptome? If not, don't use bowtie2, use an aligner meant for RNAseq (hisat, STAR, etc.).

0
Entering edit mode

Just I am aligning against human ribosomal RNA genes.

This is the bowtie2 output

27921333 reads; of these:
27921333 (100.00%) were paired; of these:
27671472 (99.11%) aligned concordantly 0 times
239838 (0.86%) aligned concordantly exactly 1 time
10023 (0.04%) aligned concordantly >1 times
----
27671472 pairs aligned concordantly 0 times; of these:
59349 (0.21%) aligned discordantly 1 time
----
27612123 pairs aligned 0 times concordantly or discordantly; of these:
55224246 mates make up the pairs; of these:
55105160 (99.78%) aligned 0 times
58050 (0.11%) aligned exactly 1 time
61036 (0.11%) aligned >1 times
1.32% overall alignment rate

0
Entering edit mode

Was your protocol only including rRNA or was it normal RNA Sequencing data?If it is meant for rRNA, then you need to really check your analysis for your alignment rate is really low... However, if you were performing polyA capture or rRNA reduction RNA Seq, then you should expect to have a low alignment rate on the rRNA.

0
Entering edit mode

Actually we are using a new sequencing kit for RNAseq, so the kit is supposed to remove 28S, 18S, 5.8S and 5S rRNA (RNA28S, RNA18S, RNA5-8S and RNA5S genes) in addition to 12S and 16S mitochondrial rRNA (MT-RNR1 and MT-RNR2).

So what I have done is, I collected all the rRNA genes from ensembl (565 fasta sequences were in the ensembl fasta file). Then I have created the index files for rRNA fasta file. Finally using bowtie2, I have aligned my reads in fastq format against these rRNA index files. Above is the bowtie2 output I received after running the following command.

\$ bowtie2 -N 0 -L 15 --fr -x Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/rRNA_index/rRNA_mtRNA -1 R1.fastq -2 R2.fastq -S output.sam
#                    ^^^^
#____________________||||

1
Entering edit mode

Depending on the species, the real rRNA genes may not be in the fasta file. This is the case, for example, in the mouse reference genome.

0
Entering edit mode

Hi Devon,

As per your first comment, you suggested me to use hisat or STAR for RNAseq. I used TopHat.

When I downloaded the rRNA genes from ensembl, I got 565 fasta sequences. Some of the gene associate name looks like below.

5_8S_rRNA   rRNA
5_8S_rRNA   rRNA……...
5S_rRNA rRNA
5S_rRNA rRNA
5S_rRNA rRNA……...
MT-RNR1 Mt_rRNA
MT-RNR2 Mt_rRNA
RNA5-8S5    rRNA
RNA5-8S5    rRNA……….
RNA5-8SP2   rRNA
RNA5-8SP3   rRNA
RNA5-8SP4   rRNA………..
RNA5S1  rRNA
RNA5S2  rRNA
RNA5S3  rRNA
RNA5S4  rRNA……..
RNA5SP523   rRNA
5S_rRNA - 28 sequences #<-- these
5_8S_rRNA - 3 sequences #<-- and these
RNA5SP - 500 sequences
RNA5S   - 17 sequences (RNA5S1-17)
RNA5-8SP - 6 sequences
RNA5-8S5 - 8 sequences


Actually we are using a new sequencing kit for RNAseq, so the kit is supposed to remove 28S, 18S, 5.8S and 5S rRNA (RNA28S, RNA18S, RNA5-8S and RNA5S genes) in addition to 12S and 16S mitochondrial rRNA (MT-RNR1 and MT-RNR2).

Do I need to consider only 31 highlighted sequences or the entire list for alignment?

1
Entering edit mode

You might want to keep the Mt_rRNA too (or at least one copy). You probably need to download the 48S sequence from NCBI, assuming that's appropriate for your organism.

0
Entering edit mode

What will happen if you just align your reads to human genome reference using STAR and use HTSeq to count the read coverage at your target rRNA regions? Will you have better coverage? And will the coverage for your rRNA be better?

If the reads still doesn't align, there might be some problem with the protocol...

0
Entering edit mode

Hi Sam,

First I carried out alignment of my entire reads against human genome reference from UCSC/hg19/Sequence/genome using Tophat. Then I used the accepted_hits.bam as input along with human GTF file for htseq_count to count the read coverage for all genes. From the list of ~24,000 genes, I couldn't find rRNA genes in that list.

Then I checked biostar, I found some posts, where they suggested to collect all the rRNA genes from ensembl and create a bowtie2 index. Also they suggested to compare my reads to the rRNA_mtrRNA bowtie2 index created. The overall alignment rate from bowtie2 alignment provides the percentage of reads mapped to rRNA genes. So I thought 1.32% from above post is the percentage of reads mapped to rRNA genes.

But I dint use STAR tool. So do you want me to replace STAR and TopHat and repeat the process.

1
Entering edit mode

If that is what you have done, then no, you don't need to repeat it with STAR, that is just another aligner.

What is your alignment rate for the TopHat alignment?

Instead of using the human GTF file, go to biomart and select Human genome. Under filter, there is a field call gene type where you can select the rRNA. That will help you to generate the gtf file for rRNA for download. Try to use that with HTSeq and see if you can get the desired results

Also, make sure the sequence is presented in your fasta file as Devon suggested

0
Entering edit mode

Hi Sam,

Human                                    Reads            SampleA
rRNA & mtRNA genes                                        1.32%     <----
TopHat- overall read mapping rate                         47.70%    <----
TopHat-Aligned Pairs                     11,123,082
TopHat-Aligned Pairs-Multiple            1,073,764        9.70%
TopHat-Aligned Pairs-Discordant          385,912          3.50%
TopHat-concordant pair alignment rate                     38.50%    <----


I am working on biomart to generate rRNAgenes.gtf file now. Kindly see my reply for Devon's post below. I have a doubt that, do I need to consider all 565 rRNA sequences or just 31 highlighted rRNA genes for building GTF?

0
Entering edit mode

Sorry about the late reply, was busy writing my thesis lately. Basically it all depends on what you want to test.

I try to read your whole post again and I think I am rather lost in what you really want to achieve.

At the beginning of your question, you were asking about the strand specificity from bowtie2, however, in the subsequent replies, you said you have done TopHat. So for TopHat, yes, if you have perform strand specific RNA Sequencing, specifying the strand will help you go a long way. If I remember correctly, the dUTP protocol will be fr for TopHat.

Finally, from my experience, the GTF file does differ slightly from different source and I usually prefer the one from Ensembl because in the early version of HTSeq (I am not sure if they have fixed it), they are more compatible to the ensembl gtf. You can simply download the full gtf from Ensembl, I believe they do contain the rRNA regions e.g. "ENSG00000222952"

However, if that is not what you want, you might have to rephrase your question.

0
Entering edit mode

Thanks Sam for getting back to me. all the best for your thesis.

Reason for doing TopHat:

First, I carried out TopHat analysis. Then the bam file from TopHat used along with GTF file in HTseq-count, to get the number of reads aligned to each gene. But, from Htseq-count I couldn't get the number of reads for rRNA genes (because human GTF file did not have annotation for rRNA genes).

Secondly from biostar posts, I come to know that I can download the rRNA genes from ensembl and aligned it with my reads using Bowtie2.

As you mentioned for RNAseq the strand specificity is very important. Does Bowtie2 also requires it (--fr)?

In your second paragraph, you mentioned it correctly. I am trying to get the percentage of rRNA reads as an index to the quality of my library. I am learning NGS tools now,

1. From tophat output, I have a file called unmapped.bam. Do I need to use the reads from this unmapped.bam file?
2. How do I extract the aligned reads from original fastq file?
3. If I need to align my unaligned reads to rRNA genomes, do I need to consider just the highlighted 31 rRNA sequences in addition to the 45S sequences mentioned by Devon?

or

1. Should I need to consider all 565 rRNA sequences list below?

5S_rRNA - 28 sequences
5_8S_rRNA - 3 sequences
RNA5SP - 500 sequences
RNA5S   - 17 sequences (RNA5S1-17)
RNA5-8SP - 6 sequences
RNA5-8S5 - 8 sequences

0
Entering edit mode
1. I don't use bowtie2 for RNA Seq so I am not sure about that. However, if you are certain that your reads are stranded, there will be no harm for you to use the --fr option
2. Yes, the unmapped.bam should contain the reads that you want to align to the rRNA genome
3. The most straight forward way will be the get the name of the reads in the bam file and just extract them from the original fastq file. However, there might as well be more efficient methods, you might want to search biostar for more information
4. That depends what you want to do. If you want to ratio of all rRNA, then include everything, If you only want to know the alignment percentage to the specific rRNA, then just align to them. This all depends on your end goal
0
Entering edit mode

Hi, I have used human hg19 version of genes.GTF file for my tophat alignment. But in biomart, I could select only GRch38.p3.

Does it impact my analysis?

After selecting GRCh38.p3 human dataset, under filter, I enabled Gene type then selected Mt_rRNA and rRNA. Finally under attributes, under Gene,I selected Ensembl Geneid, Ensembl Transcript id, Chromosome name, Gene start, Gene end, Strand, Associated gene name, Gene type. When I selected results tab, either I could download the csv or html file.

How do I generate the GTF file for rRNA genes.