Question

Fixing Encode Rna-Seq Fastq Files

1

Entering edit mode

12.3 years ago

dario.garvan ▴ 520

Looking at the older ENCODE RNA-seq FASTQs, it seems that a staff member has pasted together two paired-end FASTQ files into one, and removed all of the pairing identifiers. I inferred this from the FASTQC quality chart, which shows low quality from about 35 to about 76 bases, then a sudden jump to a high quality, then dropping again towards the end. Also, these runs are from 2009, so 152 base single end reads on the Genome Analyser II were not possible. Another element of concatenation is that this file has 80 million records, which implies concatenating of lanes. The maximum reads per lane was about 20 million in those days. Hopefully, the biases are the same across lanes.

For example, consider :

$ zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | head -n 4
@TUPAC:1:1:5:710#0/1
GTGGCGTTCAGCCACNCGAGATTGAGCAATNACNGGTCTGTGATNCNCTTAGATGTCCGGGGCTGCACGAGCGCCAAAAGACGGGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACTCGCCATTCATNNNANNNTCN
+TUPAC:1:1:5:710#0/1
a`a_\V\__\Q\aaZDZ`V`^```\^\ZZ[DPYDZM\``^V]]VDYD[__[O]`a^VJTUWY`ZWZBBBBBBBBBBabbbaabbaa``aW_\\WT__]OTZ\[QLWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Does anyone have a script which can reconstitute the two FASTQs, as they would have originally been ? It'd help with trimming the ends, which are of awful quality. I think all that is required is splitting every second line down the middle, and also changing the #0/1 to #0/2 for the other FASTQ.

There are no newer datasets of whole cell, total RNA for the ENCODE Tier 1 cell lines, so I would like to work with these files.

fastq awk • 3.8k views

ADD COMMENT • link updated 12.3 years ago by JC 13k • written 12.3 years ago by dario.garvan ▴ 520

score 1 · Answer 1 · 2013-03-13

What you probably want is to separate the reads into different files. Here is a quick script I whipped up with some code duplication but all in the name of being a little more efficient.

Run it as:

python split.py data.fq 1 > pair1.fq

or

python split.py data.fq 2 > pair2.fq

to split the pairs into different files.

import sys

fname, mode = sys.argv[1:]

def parse(stream):
    for id in stream:
        id = id.strip()
        seq = stream.next().strip()
        tmp = stream.next().strip()
        qual = stream.next().strip()
        yield id, seq, tmp, qual

stream = file(fname)

if mode == "1":

    for id, seq, tmp, qual in parse(stream):

        half = len(seq)/2
        seq  = seq[:half]
        qual = qual[:half]

        print id
        print seq
        print tmp
        print qual

else:

    for id, seq, tmp, qual in parse(stream):

        half = len(seq)/2
        seq  = seq[-half:]
        qual = qual[-half:]

        id = id[:-1] + "2"
        tmp = tmp[:-1] + "2"

        print id
        print seq
        print tmp
        print qual

score 1 · Answer 2 · 2013-03-13

Awk (long) one liner:

zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | awk '{OFS="\n"; \
print $0 >"pair.R1.fq"; \ 
sub("#0/1$","#0/2",$0); \ 
print $0 >"pair.R2.fq";  \
getline seqs; \
seqs1 =substr(seqs,0,length(seqs)/2); \
seqs2 =substr(seqs,1+length(seqs)/2,length(seqs)/2); \
getline; \
getline quals; quals1 =substr(quals,0,length(quals)/2);  \
quals2 =substr(quals,1+length(quals)/2,length(quals)/2); \
print seqs1, "+", quals1 >>"pair.R1.fq" ; print seqs2, "+", quals2 >> "pair.R2.fq"  }'

score 1 · Answer 3 · 2013-03-13

1

Entering edit mode

12.3 years ago

JC 13k

Perl option:

*edited to write output in separated files, not stdout/stderr

#!/usr/bin/perl

use strict;
use warnings;

my $file1 = shift @ARGV;
my $file2 = shift @ARGV;
open O1, ">$file1" or die;
open O2, ">$file2" or die;
my $n = 0;
while (<>) {
    $n++; 
    chomp; 
    if ($n % 2 == 0) { 
        print O1 substr($_, 0, int((length $_) / 2)), "\n"; 
        print O2 substr($_, int((length $_) / 2) + 1), "\n"; 
    }
    else { 
        print O1 "$_\n"; 
        s/1$/2/; 
        print O2 "$_\n"; 
    }
}
close O1; 
close O2;

Execute in bash as:

zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | perl splitFastq.pl reads_1.fq reads_2.fq

ADD COMMENT • link 12.3 years ago by JC 13k

0

Entering edit mode

@JC: be cautious: if any run time warnings of any kind ,they will end up in read_2.fq

ADD REPLY • link 12.3 years ago by Rm 8.3k

0

Entering edit mode

true, but can be avoided writing in separated files with minor code changes.

ADD REPLY • link 12.3 years ago by JC 13k

0

Entering edit mode

Can use FIle handles

ADD REPLY • link 12.3 years ago by Rm 8.3k