need to separate read1 and read2 to run tophat2
3
2
Entering edit mode
9.3 years ago
Ann ★ 2.4k

I have paired-end RNA-Seq data - read1 and read2 - stored in the same fastq file.

I'd like to align the reads using tophat.

Do I have to separate the data into two different files before running tophat?

RNA-Seq tophat2 pairedend • 6.3k views
ADD COMMENT
1
Entering edit mode

If you mean to separate an interleaved fastq ((2n-1)-th read to one file; (2n)-th to another):

seqtk seq -1 interleaved.fq.gz > read1.fq
seqtk seq -2 interleaved.fq.gz > read2.fq

If tophat2 support streaming, you can do something like the following without creating temporary files (bash only):

tophat2 ref.fa <(seqtk seq -1 reads.fq) <(seqtk seq -2 reads.fq)
ADD REPLY
2
Entering edit mode
9.3 years ago

If you have any pattern in the read name ( like 1:N:#### or /1 or _1 etc) you could use the fastq-grep to match the pattern to extract the R1 and R2 into two separate files.

Or a simple Awk patter match will do. something like:

zcat fastq.gz | paste - - - - | awk '{ if $1 ~ /< R1 pattern here>/ { print $1"\n"$2"\n"$3"\n"$4" }' | gzip > Read_1.fastq.gz

zcat fastq.gz | paste - - - - | awk '{ if $1 ~ /< R2 pattern here>/ { print $1"\n"$2"\n"$3"\n"$4" }' | gzip > Read_2.fastq.gz

But they need to be kept in order, if they are not.

ADD COMMENT
0
Entering edit mode

hi,

a bit late, but it can be useful for others. I just did some corrections to the good suggestion from Goutham Atla, it's faster than a lot of scripts:

zcat my_jgi_interleaved_file.fastq.gz | paste - - - - | awk '$2~ /1:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > my_jgi_read1.fastq
zcat my_jgi_interleaved_file.fastq.gz | paste - - - - | awk '$2~ /2:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > my_jgi_read1.fastq

If you want to run it on a ton of files in the same directory and have your outputs with the name of your original jgi file included with read1 and read2, and if your files are already unzipped (if not just change the first .fastq for fastq.gz and use zcat instead of cat:

for f in ./*.fastq ; do cat "$f"  | paste - - - - | awk '$2~ /1:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}'  > "$f._read1.fastq"; done
for f in ./*.fastq ; do cat "$f"  | paste - - - - | awk '$2~ /2:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}'  > "$f._read2.fastq"; done
ADD REPLY
0
Entering edit mode
9.3 years ago

I don't see anything about processing interleaved files in Tophat's manual; it only mentions paired reads in two files. But you can de-interleave the file very fast with my reformat tool (written in Java):

reformat.sh in=reads.fq out1=r1.fq out2=r2.fq
ADD COMMENT
0
Entering edit mode

Looks like classpath is correct but had problem running it:

$ bbmap/reformat.sh -Xmx2g in=1.fastq.gz out=1_1.fastq out2=1_2.fastq
java -ea -Xmx2g -cp /lustre/groups/lorainelab/data/d/fastq/bbmap/current/ jgi.ReformatReads -Xmx2g in=1.fastq.gz out=1_1.fastq out2=1_2.fastq
Exception in thread "main" java.lang.UnsupportedClassVersionError: jgi/ReformatReads : Unsupported major.minor version 51.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: jgi.ReformatReads. Program will exit.
ADD REPLY
0
Entering edit mode

Ann,

That means you have a very old version of Java installed. You can either install (or request that your sysadmin install) Java 7, or use the latest Java 6-compiled version of BBTools. I apologize for the inconvenience.

-Brian

ADD REPLY
0
Entering edit mode

I'm trying to run this on a cluster where I don't have root. Not sure how to update java in that setting.

But thank you for sending the link to the code. I appreciate the help!

ADD REPLY
0
Entering edit mode

The Java 6 version should work. But normally, in your situation, it's best to request that the sysadmin update Java; it's their responsibility to keep core software on the cluster up-to-date.

ADD REPLY
0
Entering edit mode
9.3 years ago
Ann ★ 2.4k

Follow-up:

I ended up writing a script to do this - before Geek_y's post. (Sorry, I haven't tried Geek_y's solution. It looks simpler & easier.)

Code is: splitPairs.py

https://bitbucket.org/lorainelab/genomes_src

Warning: splitPairs.py has test coverage, but other scripts may not.

ADD COMMENT
0
Entering edit mode

Thats great. If you have a robust script to handle any kind of data, that will helps you a lot. But make sure that the script keeps the pairs in same order, and keeps the orphan reads into separate file.

ADD REPLY

Login before adding your answer.

Traffic: 2487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6