Fixing Encode Rna-Seq Fastq Files
3
1
Entering edit mode
11.1 years ago
dario.garvan ▴ 530

Looking at the older ENCODE RNA-seq FASTQs, it seems that a staff member has pasted together two paired-end FASTQ files into one, and removed all of the pairing identifiers. I inferred this from the FASTQC quality chart, which shows low quality from about 35 to about 76 bases, then a sudden jump to a high quality, then dropping again towards the end. Also, these runs are from 2009, so 152 base single end reads on the Genome Analyser II were not possible. Another element of concatenation is that this file has 80 million records, which implies concatenating of lanes. The maximum reads per lane was about 20 million in those days. Hopefully, the biases are the same across lanes.

For example, consider :

$ zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | head -n 4
@TUPAC:1:1:5:710#0/1
GTGGCGTTCAGCCACNCGAGATTGAGCAATNACNGGTCTGTGATNCNCTTAGATGTCCGGGGCTGCACGAGCGCCAAAAGACGGGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACTCGCCATTCATNNNANNNTCN
+TUPAC:1:1:5:710#0/1
a`a_\V\__\Q\aaZDZ`V`^```\^\ZZ[DPYDZM\``^V]]VDYD[__[O]`a^VJTUWY`ZWZBBBBBBBBBBabbbaabbaa``aW_\\WT__]OTZ\[QLWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Does anyone have a script which can reconstitute the two FASTQs, as they would have originally been ? It'd help with trimming the ends, which are of awful quality. I think all that is required is splitting every second line down the middle, and also changing the #0/1 to #0/2 for the other FASTQ.

There are no newer datasets of whole cell, total RNA for the ENCODE Tier 1 cell lines, so I would like to work with these files.

fastq awk • 3.3k views
ADD COMMENT
1
Entering edit mode
11.1 years ago

What you probably want is to separate the reads into different files. Here is a quick script I whipped up with some code duplication but all in the name of being a little more efficient.

Run it as:

python split.py data.fq 1 > pair1.fq

or

python split.py data.fq 2 > pair2.fq

to split the pairs into different files.

import sys

fname, mode = sys.argv[1:]

def parse(stream):
    for id in stream:
        id = id.strip()
        seq = stream.next().strip()
        tmp = stream.next().strip()
        qual = stream.next().strip()
        yield id, seq, tmp, qual

stream = file(fname)

if mode == "1":

    for id, seq, tmp, qual in parse(stream):

        half = len(seq)/2
        seq  = seq[:half]
        qual = qual[:half]

        print id
        print seq
        print tmp
        print qual

else:

    for id, seq, tmp, qual in parse(stream):

        half = len(seq)/2
        seq  = seq[-half:]
        qual = qual[-half:]

        id = id[:-1] + "2"
        tmp = tmp[:-1] + "2"

        print id
        print seq
        print tmp
        print qual
ADD COMMENT
1
Entering edit mode
11.1 years ago
Rm 8.3k

Awk (long) one liner:

zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | awk '{OFS="\n"; \
print $0 >"pair.R1.fq"; \ 
sub("#0/1$","#0/2",$0); \ 
print $0 >"pair.R2.fq";  \
getline seqs; \
seqs1 =substr(seqs,0,length(seqs)/2); \
seqs2 =substr(seqs,1+length(seqs)/2,length(seqs)/2); \
getline; \
getline quals; quals1 =substr(quals,0,length(quals)/2);  \
quals2 =substr(quals,1+length(quals)/2,length(quals)/2); \
print seqs1, "+", quals1 >>"pair.R1.fq" ; print seqs2, "+", quals2 >> "pair.R2.fq"  }'
ADD COMMENT
1
Entering edit mode
11.1 years ago
JC 13k

Perl option:

*edited to write output in separated files, not stdout/stderr

#!/usr/bin/perl

use strict;
use warnings;

my $file1 = shift @ARGV;
my $file2 = shift @ARGV;
open O1, ">$file1" or die;
open O2, ">$file2" or die;
my $n = 0;
while (<>) {
    $n++; 
    chomp; 
    if ($n % 2 == 0) { 
        print O1 substr($_, 0, int((length $_) / 2)), "\n"; 
        print O2 substr($_, int((length $_) / 2) + 1), "\n"; 
    }
    else { 
        print O1 "$_\n"; 
        s/1$/2/; 
        print O2 "$_\n"; 
    }
}
close O1; 
close O2;

Execute in bash as:

zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | perl splitFastq.pl reads_1.fq reads_2.fq
ADD COMMENT
0
Entering edit mode

@JC: be cautious: if any run time warnings of any kind ,they will end up in read_2.fq

ADD REPLY
0
Entering edit mode

true, but can be avoided writing in separated files with minor code changes.

ADD REPLY
0
Entering edit mode

Can use FIle handles

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6