1
1
Entering edit mode
5.6 years ago
Genosa ▴ 150

Sorry if this may appear like a really dumb question, but I am new to RNAseq and feeling lost about some of the terminologies and calculations used in literature.

I understand what's the difference between the principle behind paired-end and single-end reads, however I need to clarify the method to calculate number of reads in paired end vs. single end reads.

For single end reads, I was told that I could calculate the no. of reads by: wc -l <name of="" fastq="" file="">, divided by 4.

For paired end reads, because 2 files are generated (R1, R2), so is it correct that I calculate the number of reads by adding the read count derived from R1 and R2?

Thank you very much for your clarification.

RNA-Seq • 6.3k views
6
Entering edit mode
5.6 years ago

In paired-end sequencing its often less confusing to talk about fragment numbers rather than the more ambiguous read numbers. So count the number of reads in R1. That's the number of fragments sequenced (it's also the number of reads in R2). If you really prefer the words "reads", you could use "we sequenced XXX read pairs", so it's clear what's meant.

1
Entering edit mode

Hi Devon,

Thank you very much for your reply. Just for clarification, I calculated the number of fragments sequenced in the fastQ file for R1 and R2 for one sample and derived this: R1 - 499385 fragments. R2 - 487958 fragments. Because there's a difference between the no. of fragments sequenced for R1 and R2, so how should I correctly describe the number of fragments sequenced? Thank you very much again.

1
Entering edit mode

You apparently screwed up the read syncing between the files at some point, likely during trimming. You'll need to resync the files. The reformat.sh tool from BBMap is convenient for this.

1
Entering edit mode

Small correction. @Genosa: Tool from BBMap you would want is repair.sh.

repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq


bbduk.sh is also a great option for trimming (and it is PE aware so this sort of situation would not happen).

0
Entering edit mode

Thank you very much. I'll go and take a look at BBMap and try to fix it. What do you mean by error in 'read syncing' and what is the reason why it was introduced in the first place? I got the files back from the sequencing service and went through all the samples. So apparently, ALL the samples R1/R2 have very different fragment counts and I am not sure why this is happening. If I do not run reformat.sh, will this adversely affect the subsequent QC/mapping/Alignment/Differential Gene Expression analysis? Thank you!

0
Entering edit mode

The Nth read in each of the files should originate from the same cluster. This was the case when the files were originally created and pretty much all aligners assume that this is true and will give non-sensical results if you, for example, delete a read in only one file. Having the Nth read in each file arising from the same cluster (btw, you can tell if this is the case because the read names are more or less the same) is referred to as "the reads being in sync."

The most common way of getting paired-end files out of sync is by using a trimmer that processes only one fastq file at a time (e.g., fastx tools). If you use something like "Trim Galore!" or "trimmomatic" then this shouldn't happen.

0
Entering edit mode

Hi Ryan, I am not sure if I've done it correctly but I'm having a problem running the bbmap program. I have zero computer commands knowledge and I'm self-teaching as I'm working along. So I downloaded the BBmap package from source forge, and basically followed the read-me.txt and did as it instructed by unzipping. I am currently using Mac OS X. So I entered terminal and changed the parent directory to where the BBMAP *.sh files re kept. Then I entered "run repair.sh" A new X-code screen now pops up and something like this shows up:

==================================
#!/bin/bash
#repair in=<infile> out=<outfile>

usage(){
echo "
Written by Brian Bushnell

Usage:  repair.sh in=<input file> out=<pair output> outs=<singleton output>

Input may be fasta or fastq, compressed or uncompressed.

Parameters:
in=<file>       The 'in=' flag is needed if the input file is not the first
parameter.  'in=stdin' will pipe from standard in.
in2=<file>      Use this if 2nd read of pairs are in a different file.
out=<file>      The 'out=' flag is needed if the output file is not the second
parameter.  'out=stdout' will pipe to standard out.
out2=<file>     Use this to write 2nd read of pairs to a different file.
outs=<file>     (outsingle) Write singleton reads here.
overwrite=t     (ow) Set to false to force the program to abort rather than
overwrite an existing file.
showspeed=t     (ss) Set to 'f' to suppress display of processing speed.
ziplevel=2      (zl) Set to 1 (lowest) through 9 (max) to change compression
level; lower compression is faster.
fint=f          (fixinterleaving) Fixes corrupted interleaved files using read
names.  Only use on files with broken interleaving - correctly
interleaved files from which some reads were removed.
names.  Uses much more memory than 'fint' mode.
ain=f           (allowidenticalnames) When detecting pair names, allows
identical names, instead of requiring /1 and /2 or 1: and 2:
monitor=f       Kill this process if it crashes.  monitor=600,0.01 would kill
after 600 seconds under 1% usage.

Java Parameters:
-Xmx            This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

"
}

pushd . > /dev/null
DIR="${BASH_SOURCE[0]}" while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")" done cd "$(dirname "$DIR")" DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" CP="$DIR""current/"

z="-Xmx4g"
z2="-Xms4g"
EA="-ea"
set=0

if [ -z "$1" ] || [[$1 == -h ]] || [[ $1 == --help ]]; then usage exit fi calcXmx () { source "$DIR""/calcmem.sh"
parseXmx "$@" if [[$set == 1 ]]; then
return
fi
freeRam 4000m 84
z="-Xmx${RAM}m" z2="-Xms${RAM}m"
}
calcXmx "$@" repair() { #module unload oracle-jdk #module load oracle-jdk/1.7_64bit #module load pigz local CMD="java$EA $z -cp$CP jgi.SplitPairsAndSingles rp $@" echo$CMD >&2
eval $CMD ===================================  Then at this stage I become a bit stuck. I tried entering the command (I have tried both terminal and X-code): "repair.sh in1=<file name="" of="" r1="">.fq in2=<file name="" of="" r2="">.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq" However, none works. My files are named in this format : SAMPLENAME_REPLICATE#_R1.fastq.gz. I wonder what have I done wrong in any of the steps? Thank you very much ADD REPLY 1 Entering edit mode Don't do run repair.sh. Do you have java installed on your Mac? BBMap is a java program. As shown in the example above you just need to run (provided the BBMap directory is in your$PATH otherwise use full path to repair.sh). There should be no spaces between any of the names, options and = sign.

/path_to/repair.sh in1=SAMPLENAME_REPLICATE#_R1.fastq.gz in2=SAMPLENAME_REPLICATE#_R2.fastq.gz out1=fixed_SAMPLENAME_REPLICATE#_R1.fastq.gz out2=fixed_SAMPLENAME_REPLICATE#_R2.fastq.gz outsingle=singletons_SAMPLENAME_REPLICATE#.fastq.gz

0
Entering edit mode

If your sample files are not in the same directory as BBMap programs then modify your command to accommodate the file paths.

/path_to/repair.sh in1=/path_to/SAMPLENAME_REPLICATE#_R1.fastq.gz in2=/path_to/SAMPLENAME_REPLICATE#_R2.fastq.gz out1=/path_to/fixed_SAMPLENAME_REPLICATE#_R1.fastq.gz out2=/path_to/fixed_SAMPLENAME_REPLICATE#_R2.fastq.gz outsingle=/path_to/singleteons_SAMPLENAME_REPLICATE#.fastq.gz

Spend a couple of hours here: http://korflab.ucdavis.edu/Unix_and_Perl/current.html#part1 familiarizing yourself with a bit of command line.