Question

RNA-Seq - Strand-specific coverage for a paired library

0

Entering edit mode

3 months ago

npb27 ▴ 20

Hi,

Apologies if it's an obvious question, I couldn't find a definitive answer

My aim is to quantify RNASeq coverage before/after independently determined termination sites

I have aligned my (stranded and paired end) RNSeq data to the reference with Bowtie2. I then split the resulting bam files by strand:

samtools view -bh -F 16 $file".bam" | samtools sort > Split_stranded/$file"_F.bam"
samtools view -bh -f 16 $file".bam" | samtools sort > Split_stranded/$file"_R.bam"

And generated a strand-specific depth table:

~/alice/NICK/Software/samtools/samtools depth -aH $files

My issue is that the apparent 3' termini from the RNASeq don't seem to match the independently determined termini. The transcripts seem to be split, with the 5' end covered by reads on the expected strand (e.g. + strand transcript covered by F reads) and the 3' end covered by reads on the other strand (e.g. + strand transcript covered by R reads).

I have read on several forums that for paired libraries, the two reads in a pair align to opposite strands, so am I then splitting the reads for a single transcript when I split the bam files by read?

How should I get proper coverage for the correct strand? Does read 1 align to the correct strand and read 2 to the opposite?

Should I then separate 4 files based on strand and read, and then merge like this:

F read 1 + R read 2
R read 1 and F read 2

Thanks!

Bowtie2 bam RNA-Seq alignment • 12k views

ADD COMMENT • link updated 12 weeks ago by swbarnes2 15k • written 3 months ago by npb27 ▴ 20

0

Entering edit mode

12 weeks ago

swbarnes2 15k

I then split the resulting bam files by strand:

Umm, why?

for paired libraries, the two reads in a pair align to opposite strands, so am I then splitting the reads for a single transcript when I split the bam files by read?

Yes.

What percentage of your reads have the "properly paired" 2 flag set? If that percentage is high, then its all fine.

ADD COMMENT • link 12 weeks ago by swbarnes2 15k

score 2 · Accepted Answer · 2025-08-22

I think you got the idea with the "4 separate files" and combining them first two and second two. A good point of reference is the IGV "first of pair strand" color scheme. In this case, the reads are paired, but the second of the pair adopts the same color as the first of pair strand. This ends up coloring "stranded paired-end RNA-seq" correctly.

Here is a little script I made with the help of google gemini that kind of recapitulates the same idea, and does the process in a single combined bash line

#!/bin/bash

# This script calculates strand-specific coverage from a BAM file,
# with a special rule: for reads that are NOT the first in a pair,
# their strand orientation is flipped for coverage calculation.
# It generates two output files in BEDGRAPH format:
# one for effective positive strand coverage and one for effective negative strand coverage.

# Check if samtools is installed
if ! command -v samtools &>/dev/null; then
    echo "samtools could not be found. Please ensure it is installed and in your PATH."
    exit 1
fi

# Check for correct number of arguments
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <input_bam_file>"
    echo "Example: $0 alignment.bam"
    exit 1
fi

INPUT_BAM="$1"

# Check if the input BAM file exists
if [ ! -f "$INPUT_BAM" ]; then
    echo "Error: Input BAM file '$INPUT_BAM' not found."
    exit 1
fi

# Derive output file prefix from the input BAM file name
# This removes the .bam or .sam extension to create a base name for output files.
FILE_PREFIX=$(basename "$INPUT_BAM")
FILE_PREFIX="${FILE_PREFIX%.bam}" # Remove .bam extension
FILE_PREFIX="${FILE_PREFIX%.sam}" # Remove .sam extension (just in case)

OUTPUT_POS_STRAND_COV="${FILE_PREFIX}_effective_positive_strand_coverage.bedgraph"
OUTPUT_NEG_STRAND_COV="${FILE_PREFIX}_effective_negative_strand_coverage.bedgraph"

echo "Processing BAM file: $INPUT_BAM"
echo "Output for effective positive strand coverage will be: $OUTPUT_POS_STRAND_COV"
echo "Output for effective negative strand coverage will be: $OUTPUT_NEG_STRAND_COV"
echo ""

# --- Calculate coverage for "Effective Positive Strand" reads ---
# Effective Positive Strand reads include:
# 1. Reads that are FIRST IN PAIR (-f 0x40) AND NOT REVERSE COMPLEMENTED (-F 0x10) (i.e., forward strand)
# 2. Reads that are NOT FIRST IN PAIR (-F 0x40) AND ARE REVERSE COMPLEMENTED (-f 0x10) (i.e., flipped reverse strand)
echo "Calculating effective positive strand coverage and formatting to bedGraph..."
{
    samtools view -b -f 0x40 -F 0x10 "$INPUT_BAM"
    samtools view -b -F 0x40 -f 0x10 "$INPUT_BAM"
} | samtools depth - | awk '{print $1"\t"$2-1"\t"$2"\t"$3}' >"$OUTPUT_POS_STRAND_COV"

# Check the exit status of the previous command
if [ $? -eq 0 ]; then
    echo "Successfully generated effective positive strand coverage file: $OUTPUT_POS_STRAND_COV"
else
    echo "Error: Failed to generate effective positive strand coverage for '$INPUT_BAM'."
fi
echo ""

# --- Calculate coverage for "Effective Negative Strand" reads ---
# Effective Negative Strand reads include:
# 1. Reads that are FIRST IN PAIR (-f 0x40) AND ARE REVERSE COMPLEMENTED (-f 0x10) (i.e., reverse strand)
# 2. Reads that are NOT FIRST IN PAIR (-F 0x40) AND NOT REVERSE COMPLEMENTED (-F 0x10) (i.e., flipped forward strand)
echo "Calculating effective negative strand coverage and formatting to bedGraph..."
{
    samtools view -b -f 0x40 -f 0x10 "$INPUT_BAM"
    samtools view -b -F 0x40 -F 0x10 "$INPUT_BAM"
} | samtools depth - | awk '{print $1"\t"$2-1"\t"$2"\t"$3}' >"$OUTPUT_NEG_STRAND_COV"

# Check the exit status of the previous command
if [ $? -eq 0 ]; then
    echo "Successfully generated effective negative strand coverage file: $OUTPUT_NEG_STRAND_COV"
else
    echo "Error: Failed to generate effective negative strand coverage for '$INPUT_BAM'."
fi
echo ""

echo "Script finished."