Question: R1 R2 orientation (RNA-seq)
2
gravatar for biotech
5.2 years ago by
biotech540
United States
biotech540 wrote:

Hi,

Will always R2 file contain sense transcript and R1 reverse complement of transcript?

If R2 file contains sense transcript, I will have to run HTSeq.scripts.count with "reverse" setting to quantify sense transcription. Quite confusing, isn't it?

Library creation kit -> E7420S NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina®

Thanks, Bernardo

P.S. Here is the pipeline:

set -ue
#################################################
#BWA test aligner with paired-end data
#################################################

# Get the genome file from the command line
genome_file=$1
# Get the fastq file from the command line
fastq_file_R1=$2
# Get the fastq file from the command line
fastq_file_R2=$3
# Get the fastq file from the command line
fastq_file_R3=$4
# Get the fastq file from the command line
fastq_file_R4=$5
#get gff
GFF=$6

#BWA default settings
bwa index $genome_file
#BWA input
bwa mem -t 8 $genome_file $fastq_file_R1 $fastq_file_R2 | gzip -3 > P_S1_L001_aln-pe.sam.gz
bwa mem -t 8 $genome_file $fastq_file_R3 $fastq_file_R4 | gzip -3 > V_S1_L001_aln-pe.sam.gz

################################################################
#Flagstat
################################################################
#Convert .sam to .bam to input to Flagstat
samtools view -b -S -o P_S1_L001_aln-pe.bam P_S1_L001_aln-pe.sam.gz
samtools flagstat P_S1_L001_aln-pe.bam

samtools view -b -S -o V_S1_L001_aln-pe.bam V_S1_L001_aln-pe.sam.gz
samtools flagstat V_S1_L001_aln-pe.bam

################################################################
#Count reads mapped with htseq-count
################################################################

samtools sort -n V_S1_L001_aln-pe.bam invivo.sorted
python -m HTSeq.scripts.count -m intersection-nonempty -f bam -a 10 -t mRNA -i Parent -r name -s reverse invivo.sorted.bam $GFF | awk 'n>=5 { print a[n%5] } { a[n++%5]=$0 }' > invivo_R.counts

samtools sort -n P_S1_L001_aln-pe.bam plate.sorted
python -m HTSeq.scripts.count -m intersection-nonempty -f bam -a 10 -t mRNA -i Parent -r name -s reverse plate.sorted.bam $GFF | awk 'n>=5 { print a[n%5] } { a[n++%5]=$0 }' > plate_R.counts


################################################################
#IGV
################################################################
#samtools sort -n sorts by name, not coordinate.. index requires sort by coordinate..
samtools sort V_S1_L001_aln-pe.bam V_S1_L001_aln-pe.bam.sorted
samtools index V_S1_L001_aln-pe.bam.sorted.bam

samtools sort P_S1_L001_aln-pe.bam P_S1_L001_aln-pe.bam.sorted
samtools index P_S1_L001_aln-pe.bam.sorted.bam

bash rna-seq pipeline htseq • 5.0k views
ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by biotech540

A file should never contain 3'->5' nucleotide sequence. Whether read #1 or read #2 will dictate the originating strand when a directional kit is used will depend on the kit. When in doubt, just look at a couple genes in IGV or another browser. It should be immediately obvious which which read dictates strand.

ADD REPLYlink written 5.2 years ago by Devon Ryan91k

Thanks for your answer Devon, the question has been reformulated. You answered to the first question of my post. I performed the IGV inspection step and it's clear that R2 contains sense transcript. I will have to answer how library was prepared to confirm this IGV test.

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by biotech540
3
gravatar for Charles Warden
5.2 years ago by
Charles Warden7.2k
Duarte, CA
Charles Warden7.2k wrote:

I believe the "how" is answered on this website  (1st Figure under product information tab):

https://www.neb.com/products/e7420-nebnext-ultra-directional-rna-library-prep-kit-for-illumina

Seems like it is a dUTP protocol, which would mean you should be using "--library-type fr-firststrand" in TopHat:

http://ccb.jhu.edu/software/tophat/manual.shtml#toph

http://ccb.jhu.edu/software/tophat/faq.shtml#library_type

 

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Charles Warden7.2k

I'm following BWA, samtools, HtSeq pipeline. R2 file would contain the sense strand. Right?

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by biotech540

You can double-check with the vendor - in the picture, it appeared that dUTPs were being added and then degraded, meaning that only the original strand was sequenced.  The notation is a bit confusing, but this means the reverse complement is the first Illumina sequence that you get from the sense template (meaning R1 is anti-sense and R2 is sense, matching what you expect).

ADD REPLYlink written 5.2 years ago by Charles Warden7.2k

I will double-check with the vendor but I'm 99% sure R2 is sense because of the larger number of reads counted. I added the pipeline I'm using. The most funny thing is that the technician who did the library told me: "Well, you only have to analyse data with TopHat, Cufflinks, etc and will tell you is is positive or not..." :-O

ADD REPLYlink written 5.2 years ago by biotech540
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour