Question: need to separate read1 and read2 to run tophat2
2
gravatar for Ann
4.4 years ago by
Ann2.2k
Concord NC USA
Ann2.2k wrote:

I have paired-end RNA-Seq data - read1 and read2 - stored in the same fastq file.

I'd like to align the reads using tophat.

Do I have to separate the data into two different files before running tophat?

rna-seq tophat2 pairedend • 3.5k views
ADD COMMENTlink modified 4.3 years ago • written 4.4 years ago by Ann2.2k
1

If you mean to separate an interleaved fastq ((2n-1)-th read to one file; (2n)-th to another):

seqtk seq -1 interleaved.fq.gz > read1.fq
seqtk seq -2 interleaved.fq.gz > read2.fq

If tophat2 support streaming, you can do something like the following without creating temporary files (bash only):

tophat2 ref.fa <(seqtk seq -1 reads.fq) <(seqtk seq -2 reads.fq)
ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by lh331k
2
gravatar for geek_y
4.4 years ago by
geek_y9.4k
Barcelona/CRG/London/Imperial
geek_y9.4k wrote:

If you have any pattern in the read name ( like 1:N:#### or /1 or _1 etc) you could use the fastq-grep to match the pattern to extract the R1 and R2 into two separate files.

Or a simple Awk patter match will do. something like:

zcat fastq.gz | paste - - - - | awk '{ if $1 ~ /< R1 pattern here>/ { print $1"\n"$2"\n"$3"\n"$4" }' | gzip > Read_1.fastq.gz
zcat fastq.gz | paste - - - - | awk '{ if $1 ~ /< R2 pattern here>/ { print $1"\n"$2"\n"$3"\n"$4" }' | gzip > Read_2.fastq.gz

But they need to be kept in order, if they are not.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by geek_y9.4k

hi, a bit late, but it can be useful for others. I just did some corrections to the good suggestion from Goutham Atla, it's faster than a lot of scripts: zcat my_jgi_interleaved_file.fastq.gz | paste - - - - | awk '$2~ /1:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > my_jgi_read1.fastq zcat my_jgi_interleaved_file.fastq.gz | paste - - - - | awk '$2~ /2:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > my_jgi_read1.fastq

If you want to run it on a ton of files in the same directory and have your outputs with the name of your original jgi file included with read1 and read2, and if your files are already unzipped (if not just change the first .fastq for fastq.gz and use zcat instead of cat: for f in ./.fastq ; do cat "$f" | paste - - - - | awk '$2~ /1:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > "$f._read1.fastq"; done for f in ./.fastq ; do cat "$f" | paste - - - - | awk '$2~ /2:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > "$f._read2.fastq"; done

ADD REPLYlink written 3.1 years ago by sebastien.cecillon0
0
gravatar for Brian Bushnell
4.4 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I don't see anything about processing interleaved files in Tophat's manual; it only mentions paired reads in two files.  But you can de-interleave the file very fast with my reformat tool (written in Java):

reformat.sh in=reads.fq out1=r1.fq out2=r2.fq

 

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Brian Bushnell16k

Looks like classpath is correct but had problem running it:

$ bbmap/reformat.sh  -Xmx2g in=1.fastq.gz out=1_1.fastq out2=1_2.fastq
java -ea -Xmx2g -cp /lustre/groups/lorainelab/data/d/fastq/bbmap/current/ jgi.ReformatReads -Xmx2g in=1.fastq.gz out=1_1.fastq out2=1_2.fastq
Exception in thread "main" java.lang.UnsupportedClassVersionError: jgi/ReformatReads : Unsupported major.minor version 51.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: jgi.ReformatReads. Program will exit.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Ann2.2k

Ann,

That means you have a very old version of Java installed.  You can either install (or request that your sysadmin install) Java 7, or use the latest Java 6-compiled version of BBTools.  I apologize for the inconvenience.

-Brian

ADD REPLYlink written 4.4 years ago by Brian Bushnell16k

I'm trying to run this on a cluster where I don't have root. Not sure how to update java in that setting.

But thank you for sending the link to the code. I appreciate the help!

ADD REPLYlink written 4.4 years ago by Ann2.2k

The Java 6 version should work.  But normally, in your situation, it's best to request that the sysadmin update Java; it's their responsibility to keep core software on the cluster up-to-date.

ADD REPLYlink written 4.4 years ago by Brian Bushnell16k
0
gravatar for Ann
4.3 years ago by
Ann2.2k
Concord NC USA
Ann2.2k wrote:

Follow-up:

I ended up writing a script to do this - before Geek_y's post. (Sorry, I haven't tried Geek_y's solution. It looks simpler & easier.)

Code is: splitPairs.py

https://bitbucket.org/lorainelab/genomes_src

Warning: splitPairs.py has test coverage, but other scripts may not.

 

 

 

 

ADD COMMENTlink written 4.3 years ago by Ann2.2k

Thats great. If you have a robust script to handle any kind of data, that will helps you a lot. But make sure that the script keeps the pairs in same order, and keeps the orphan reads into separate file.

ADD REPLYlink written 4.3 years ago by geek_y9.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 914 users visited in the last hour