Today we have run into the task of having to split strand specific RNA-Seq data by strand and we had to make an effort to get it right (hopefully it is right). Maybe there is even an easier way to do it.
The Illumina strand specific protocol is such that only the fragment of the original orientation is sequenced. But as it happens the reverse complement of it will actually be sequenced and is placed into file
R1. The mate pair that goes into file
R2 will be the reverse complement of the fragment starting in
R1 and thus its orientation will correspond to the original orientation. This makes the protocol produce a
R2-R1 read orientation.
Don't forget to perform a sanity check. The total coverage should be equal to the sum of coverages for the split data.
Here is our script that splits a bam file into two
rev.bam each containing the alignments that are in the transcriptional direction. Run int as:
sh split.sh input.bam
Shell script using samtools:
set -ue # Get the bam file from the command line DATA=$1 # Forward strand. # # 1. alignments of the second in pair if they map to the forward strand # 2. alignments of the first in pair if they map to the reverse strand # samtools view -b -f 128 -F 16 $DATA > fwd1.bam samtools index fwd1.bam samtools view -b -f 80 $DATA > fwd2.bam samtools index fwd2.bam # # Combine alignments that originate on the forward strand. # samtools merge -f fwd.bam fwd1.bam fwd2.bam samtools index fwd.bam # Reverse strand # # 1. alignments of the second in pair if they map to the reverse strand # 2. alignments of the first in pair if they map to the forward strand # samtools view -b -f 144 $DATA > rev1.bam samtools index rev1.bam samtools view -b -f 64 -F 16 $DATA > rev2.bam samtools index rev2.bam # # Combine alignments that originate on the reverse strand. # samtools merge -f rev.bam rev1.bam rev2.bam samtools index rev.bam
If keeping full pairs is essential then the filtering steps need to be performed to operate on mates like so:
# # 1. alignments of the second in pair if they map to the forward strand # 2. alignments of the first in pair if their mate maps to the forward strand # samtools view -b -f 128 -F 16 $DATA > fwd1.bam samtools index fwd1.bam samtools view -b -f 64 -F 32 $DATA > fwd2.bam samtools index fwd2.bam
# 1. alignments of the second in pair if it maps to the reverse strand # 2. alignments of the first in pair if their mates map to the reverse strand # samtools view -b -f 144 $DATA > rev1.bam samtools index rev1.bam samtools view -b -f 96 $DATA > rev2.bam samtools index rev2.bam
Hi Istvan Albert! Just to tell u a big THANK YOU for the amazing script. It works wonders :) Looked around everywhere for this..until I read yours. The antisense reads obtained now also correlate to the RSeQC infer_experiment.py script that determines strandedness. so thats a good proof check too. Thanks!
glad to hear it helped - it is one of those things that should not be this complicated.
Hi Albert, I wanted to check with you if this script for seperating forward and reverse will also work on RNA sequencing done using dUTP method for strand specificity? Or is it only when we use different adaptors as means for strand specificity (like the one described in the ScriptSeq protocol)? Thanks!
I am not sufficiently familiar with the alternative library prep for various protocols. The separation in the script above applies only when the the second in pair (read 2) is oriented in the original transcript direction.
When the 1 st pair is oriented in the original transcript direction the flags need to be changed. The logic is the same only you need to select the exact opposite.
Hey..ur script worked for my data where the protocol I had used for RNAseq was called scriptseq - which uses adaptors. Is that what you mean by "when the the second in pair (read 2) is oriented in the original transcript direction."? Sorry, am a little confused now!
Hello! As mentioned above, the strand specific script worked really well and I have used that for my work. I am happy to inform you that we are writing a paper on the data analysis soon, so wanted to check with you how should I cite you? As in do you have this script published any where or could I mark this URL in references? Thanks :)
That won't be necessary - think of it as a book that you've learned certain basic skills from rather than a source of a scientific insight.
Hi Istavan, was looking at your script and wondering if it will account for the second read and change its orientation to map on the same strand as the first read so as to get the complete directional fidelity of the illumina strand-specific RNA-seq data with pair-end reads? Maybe some background to my question is that, i extracted stranded reads from my bam files, but it appeared that i was getting only picking 2 of my pair-end reads as single reads on each strand thus ended up with almost the same coverage plot for both strands..any suggestions on how to best get a true representation of each strand for my illumina pair-end reads?
what gets sequenced is a DNA that comes from a template so you will necessarily get (almost) equal amounts of reads for both strands. What the strand specific protocol tells you is what was the original orientation of the template. The read that comes from file 2 tells the story of what was the original orientation and the data in file 1 is the reverse complement of the other end of the template.
Thanx Istvan, that really clarified a lot of things for me
I was working on this script. I divided my reads inot fwd and reverse strand reads using your code. When I visualized it using IGV, the reads of the reverse strand (i.e from file rev.bam) had an orientation F1R2 and the reads from the forward strand (fwd.bam) had an orientation F2R1. But some literature say that F2R1 is created from the antisense strand and F1R2 is created from sense strand. Can you please clear this confusion
it depends on the library preparation step. Over the course of the library prep there are typically many steps of duplicating and degrading the single stranded molecules. Depending on how many steps there are the sequenced product may be the original sequence or the reverse complement of it and thus the actual 5' end will be placed into file1 or file2. The typical Illumina protocol is the one described above. Here are more details:
Comprehensive comparative analysis of strand-specific RNA sequencing methods, Nature Methods, 2010
Nice, this is what I want. by the way, it is fit for dUTP strand-specific RNA-seq.
Here is another problem when I create Bedgraph files using these BAM files:
(1) what if the length of reads (read1 + read2) shorter than the insert seq? what is your choice for the gap that not sequenced that will not shown in the bedgraph?
(2) what if the length of reads (read1 + read2) longer that the insert seq? the overlap region of the paired reads will double counted, is that OK?
in general it is not advisable to ask a new question in a comment because most people won't notice that question. When you have paired end reads and you count them you'd be always be double counting regardless of the insert size -> one fragment leads to two counts. If the coverage is reasonable uniform the size of the insert should not matter as it should uniformly cover all regions.
Hi just found this script and wanted to say thanks. It works fine and saved me a lot of trouble :)
Can it be used for alignment file generated from stranded single-end read data (ion proton)?
I am not entirely sure how the protocol works for ion-proton. It would only be applicable if the sequenced read comes from the reverse complement of the transcript.
Wondering if this or a similar approach can be taken for lectures coming from a stranded (dUTP) library..
Hi Istvan. I am facing a task of separating sense from antisense transcripts and this post looks very helpful. Could you please clarify for me whether the forward and reverse strands are relative to a chromosome or they actually means sense and antisense strands? I am getting confused..Thank you very much and I look forward to your reply!
Hi, I'm new to bioinformatics so I'm trying to understand how the SAM flags work in this script. My original SAM file has flags: 99, 147, 163, and 83. But this script uses flags: 128, 80, 144, and 64. Can someone help me understand the difference between these flags? Would this script still separate my BAM file into forward and reverse strands with the flags it lists or should I edit the script to reflect my flags (99, 147, 163, 83)? Thank you for any information and help you can provide about these questions so I can understand this better!
Any odd flag has the byte 1 on because paired end reads were used in the mapping
Obviously any even flag number means you were mapping single reads
Thanks much! It helped!