Tutorial: How To Separate Illumina Based Strand Specific Rna-Seq Alignments By Strand
32
gravatar for Istvan Albert
3.8 years ago by
Istvan Albert ♦♦ 74k
University Park, USA
Istvan Albert ♦♦ 74k wrote:

Today we have run into the task of having to split strand specific RNA-Seq data by strand and we had to make an effort to get it right (hopefully it is right). Maybe there is even an easier way to do it.

The Illumina strand specific protocol is such that only the fragment of the original orientation is sequenced. But as it happens the reverse complement of it will actually be sequenced and is placed into file R1. The mate pair that goes into file R2 will be the reverse complement of the fragment starting in R1 and thus its orientation will correspond to the original orientation. This makes the protocol produce a R2-R1 read orientation.

Don't forget to perform a sanity check. The total coverage should be equal to the sum of coverages for the split data.

Here is our script that splits a bam file into two fwd.bam and rev.bam each containing the alignments that are in the transcriptional direction. Run int as:

sh split.sh input.bam

Shell script using samtools:

set -ue

# Get the bam file from the command line
DATA=$1

# Forward strand.
#
# 1. alignments of the second in pair if they map to the forward strand
# 2. alignments of the first in pair if they map to the reverse  strand
#
samtools view -b -f 128 -F 16 $DATA > fwd1.bam
samtools index fwd1.bam

samtools view -b -f 80 $DATA > fwd2.bam
samtools index fwd2.bam

#
# Combine alignments that originate on the forward strand.
#
samtools merge -f fwd.bam fwd1.bam fwd2.bam
samtools index fwd.bam

# Reverse strand
#
# 1. alignments of the second in pair if they map to the reverse strand
# 2. alignments of the first in pair if they map to the forward strand
#
samtools view -b -f 144 $DATA > rev1.bam
samtools index rev1.bam

samtools view -b -f 64 -F 16 $DATA > rev2.bam
samtools index rev2.bam

#
# Combine alignments that originate on the reverse strand.
#
samtools merge -f rev.bam rev1.bam rev2.bam
samtools index rev.bam

If keeping full pairs is essential then the filtering steps need to be performed to operate on mates like so:

#
# 1. alignments of the second in pair if they map to the forward strand
# 2. alignments of the first in pair if their mate maps to the forward strand
#
samtools view -b -f 128 -F 16 $DATA > fwd1.bam
samtools index fwd1.bam

samtools view -b -f 64 -F 32 $DATA > fwd2.bam
samtools index fwd2.bam

and similarly:

# 1. alignments of the second in pair if it maps to the reverse strand
# 2. alignments of the first in pair if their mates map to the reverse strand
#
samtools view -b -f 144 $DATA > rev1.bam
samtools index rev1.bam

samtools view -b -f 96 $DATA > rev2.bam
samtools index rev2.bam
tutorial rna-seq • 11k views
ADD COMMENTlink modified 20 months ago by fizer20 • written 3.8 years ago by Istvan Albert ♦♦ 74k
1

Hi Istvan Albert! Just to tell u a big THANK YOU for the amazing script. It works wonders :) Looked around everywhere for this..until I read yours. The antisense reads obtained now also correlate to the RSeQC infer_experiment.py script that determines strandedness. so thats a good proof check too. Thanks!

ADD REPLYlink written 14 months ago by nalini150710

glad to hear it helped - it is one of those things that should not be this complicated.

ADD REPLYlink written 14 months ago by Istvan Albert ♦♦ 74k

Hi Albert, I wanted to check with you if this script for seperating forward and reverse will also work on RNA sequencing done using dUTP method for strand specificity? Or is it only when we use different adaptors as means for strand specificity (like the one described in the ScriptSeq protocol)? Thanks!

ADD REPLYlink written 14 months ago by nalini150710

I am not sufficiently familiar with the alternative library prep for various protocols. The separation in the script above applies only when the the second in pair (read 2) is oriented in the original transcript direction.

When the 1 st pair is oriented in the original transcript direction the flags need to be changed. The logic is the same only you need to select the exact opposite.

ADD REPLYlink written 14 months ago by Istvan Albert ♦♦ 74k

Hey..ur script worked for my data where the protocol I had used for RNAseq was called scriptseq - which uses adaptors. Is that what you mean by "when the the second in pair (read 2) is oriented in the original transcript direction."? Sorry, am a little confused now!

ADD REPLYlink written 14 months ago by nalini150710

Hello! As mentioned above, the strand specific script worked really well and I have used that for my work. I am happy to inform you that we are writing a paper on the data analysis soon, so wanted to check with you how should I cite you? As in do you have this script published any where or could I mark this URL in references? Thanks :)

ADD REPLYlink written 3 months ago by nalini150710
1

That won't be necessary - think of it as a book that you've learned certain basic skills from rather than a source of a scientific insight.

ADD REPLYlink written 3 months ago by Istvan Albert ♦♦ 74k

Hi Istavan, was looking at your script and wondering if it will account for the second read and change its orientation to map on the same strand as the first read so as to get the complete directional fidelity of the illumina strand-specific RNA-seq data with pair-end reads? Maybe some background to my question is that, i extracted stranded reads from my bam files, but it appeared that i was getting only picking 2 of my pair-end reads as single reads on each strand thus ended up with almost the same coverage plot for both strands..any suggestions on how to best get a true representation of each strand for my illumina pair-end reads?

Thanx

ADD REPLYlink modified 3.6 years ago by Istvan Albert ♦♦ 74k • written 3.6 years ago by stan50
1

what gets sequenced is a DNA that comes from a template so you will necessarily get (almost) equal amounts of reads for both strands.  What the strand specific protocol tells you is what was the original orientation of the template. The read that comes from file 2 tells the story of what was the original orientation and the data in file 1 is the reverse complement of the other end of the template.

ADD REPLYlink written 3.6 years ago by Istvan Albert ♦♦ 74k

Thanx Istvan, that really clarified a lot of things for me

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by stan50

Hey Istavan

I was working on this script. I divided my reads inot fwd and reverse strand reads using your code. When I visualized it using IGV, the reads of the reverse strand (i.e from file rev.bam) had an orientation F1R2 and the reads from the forward strand (fwd.bam) had an orientation F2R1. But some literature say that F2R1 is created from the antisense strand and F1R2 is created from sense strand. Can you please clear this confusion

 

ADD REPLYlink written 3.1 years ago by AriyurD0

it depends on the library preparation step. Over the course of the library prep there are typically many steps of duplicating and degrading the single stranded molecules. Depending on how many steps there are the sequenced product may be the original sequence or the reverse complement of it and thus the actual 5' end will be placed into file1 or file2. The typical Illumina protocol is the one described above. Here are more details:

Comprehensive comparative analysis of strand-specific RNA sequencing methods, Nature Methods, 2010
 

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Istvan Albert ♦♦ 74k

Nice, this is what I want. by the way, it is fit for dUTP strand-specific RNA-seq.

Here is another problem when I create Bedgraph files using these BAM files:

(1) what if the length of reads (read1 + read2) shorter than the insert seq? what is your choice for the gap that not sequenced that will not shown in the bedgraph?

(2) what if the length of reads (read1 + read2) longer that the insert seq? the overlap region of the paired reads will double counted, is that OK?

ADD REPLYlink written 2.5 years ago by wm90
1

in general it is not advisable to ask a new question in a comment because most people won't notice that question. When you have paired end reads and you count them you'd be always be double counting regardless of the insert size -> one fragment leads to two counts.  If the coverage is reasonable uniform the size of the insert should not matter as it should uniformly cover all regions.

ADD REPLYlink written 2.5 years ago by Istvan Albert ♦♦ 74k

Can it be used for alignment file generated from stranded single-end read data (ion proton)?

ADD REPLYlink written 20 months ago by fizer20

I am not entirely sure how the protocol works for ion-proton. It would only be applicable if the sequenced read comes from the reverse complement of the transcript.

ADD REPLYlink written 20 months ago by Istvan Albert ♦♦ 74k
0
gravatar for kenietz78
2.3 years ago by
kenietz780
kenietz780 wrote:

Hi just found this script and wanted to say thanks. It works fine and saved me a lot of trouble :)

ADD COMMENTlink written 2.3 years ago by kenietz780
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1355 users visited in the last hour