R1 R2 orientation (RNA-seq)
1
2
Entering edit mode
9.8 years ago
biotech ▴ 570

Hi,

Will always R2 file contain sense transcript and R1 reverse complement of transcript?

If R2 file contains sense transcript, I will have to run HTSeq.scripts.count with "reverse" setting to quantify sense transcription. Quite confusing, isn't it?

Library creation kit -> E7420S NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina®

Thanks, Bernardo

P.S. Here is the pipeline:

set -ue
#################################################
#BWA test aligner with paired-end data
#################################################

# Get the genome file from the command line
genome_file=$1
# Get the fastq file from the command line
fastq_file_R1=$2
# Get the fastq file from the command line
fastq_file_R2=$3
# Get the fastq file from the command line
fastq_file_R3=$4
# Get the fastq file from the command line
fastq_file_R4=$5
#get gff
GFF=$6

#BWA default settings
bwa index $genome_file
#BWA input
bwa mem -t 8 $genome_file $fastq_file_R1 $fastq_file_R2 | gzip -3 > P_S1_L001_aln-pe.sam.gz
bwa mem -t 8 $genome_file $fastq_file_R3 $fastq_file_R4 | gzip -3 > V_S1_L001_aln-pe.sam.gz

################################################################
#Flagstat
################################################################
#Convert .sam to .bam to input to Flagstat
samtools view -b -S -o P_S1_L001_aln-pe.bam P_S1_L001_aln-pe.sam.gz
samtools flagstat P_S1_L001_aln-pe.bam

samtools view -b -S -o V_S1_L001_aln-pe.bam V_S1_L001_aln-pe.sam.gz
samtools flagstat V_S1_L001_aln-pe.bam

################################################################
#Count reads mapped with htseq-count
################################################################

samtools sort -n V_S1_L001_aln-pe.bam invivo.sorted
python -m HTSeq.scripts.count -m intersection-nonempty -f bam -a 10 -t mRNA -i Parent -r name -s reverse invivo.sorted.bam $GFF | awk 'n>=5 { print a[n%5] } { a[n++%5]=$0 }' > invivo_R.counts

samtools sort -n P_S1_L001_aln-pe.bam plate.sorted
python -m HTSeq.scripts.count -m intersection-nonempty -f bam -a 10 -t mRNA -i Parent -r name -s reverse plate.sorted.bam $GFF | awk 'n>=5 { print a[n%5] } { a[n++%5]=$0 }' > plate_R.counts


################################################################
#IGV
################################################################
#samtools sort -n sorts by name, not coordinate.. index requires sort by coordinate..
samtools sort V_S1_L001_aln-pe.bam V_S1_L001_aln-pe.bam.sorted
samtools index V_S1_L001_aln-pe.bam.sorted.bam

samtools sort P_S1_L001_aln-pe.bam P_S1_L001_aln-pe.bam.sorted
samtools index P_S1_L001_aln-pe.bam.sorted.bam

RNA-seq HTSeq pipeline bash • 7.4k views
ADD COMMENT
0
Entering edit mode

A file should never contain 3'->5' nucleotide sequence. Whether read #1 or read #2 will dictate the originating strand when a directional kit is used will depend on the kit. When in doubt, just look at a couple genes in IGV or another browser. It should be immediately obvious which which read dictates strand.

ADD REPLY
0
Entering edit mode

Thanks for your answer Devon, the question has been reformulated. You answered to the first question of my post. I performed the IGV inspection step and it's clear that R2 contains sense transcript. I will have to answer how library was prepared to confirm this IGV test.

ADD REPLY
3
Entering edit mode
9.8 years ago

I believe the "how" is answered on this website (1st Figure under product information tab):

https://www.neb.com/products/e7420-nebnext-ultra-directional-rna-library-prep-kit-for-illumina

Seems like it is a dUTP protocol, which would mean you should be using "--library-type fr-firststrand" in TopHat:

http://ccb.jhu.edu/software/tophat/manual.shtml#toph

http://ccb.jhu.edu/software/tophat/faq.shtml#library_type

ADD COMMENT
0
Entering edit mode

I'm following BWA, samtools, HtSeq pipeline. R2 file would contain the sense strand. Right?

ADD REPLY
0
Entering edit mode

You can double-check with the vendor - in the picture, it appeared that dUTPs were being added and then degraded, meaning that only the original strand was sequenced. The notation is a bit confusing, but this means the reverse complement is the first Illumina sequence that you get from the sense template (meaning R1 is anti-sense and R2 is sense, matching what you expect).

ADD REPLY
0
Entering edit mode

I will double-check with the vendor but I'm 99% sure R2 is sense because of the larger number of reads counted. I added the pipeline I'm using. The most funny thing is that the technician who did the library told me: "Well, you only have to analyse data with TopHat, Cufflinks, etc and will tell you is is positive or not..." :-O

ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6