Question: Fixing Encode Rna-Seq Fastq Files
1
gravatar for dario.garvan
6.9 years ago by
dario.garvan460
Australia
dario.garvan460 wrote:

Looking at the older ENCODE RNA-seq FASTQs, it seems that a staff member has pasted together two paired-end FASTQ files into one, and removed all of the pairing identifiers. I inferred this from the FASTQC quality chart, which shows low quality from about 35 to about 76 bases, then a sudden jump to a high quality, then dropping again towards the end. Also, these runs are from 2009, so 152 base single end reads on the Genome Analyser II were not possible. Another element of concatenation is that this file has 80 million records, which implies concatenating of lanes. The maximum reads per lane was about 20 million in those days. Hopefully, the biases are the same across lanes.

For example, consider :

$ zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | head -n 4
@TUPAC:1:1:5:710#0/1
GTGGCGTTCAGCCACNCGAGATTGAGCAATNACNGGTCTGTGATNCNCTTAGATGTCCGGGGCTGCACGAGCGCCAAAAGACGGGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACTCGCCATTCATNNNANNNTCN
+TUPAC:1:1:5:710#0/1
a`a_\V\__\Q\aaZDZ`V`^```\^\ZZ[DPYDZM\``^V]]VDYD[__[O]`a^VJTUWY`ZWZBBBBBBBBBBabbbaabbaa``aW_\\WT__]OTZ\[QLWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Does anyone have a script which can reconstitute the two FASTQs, as they would have originally been ? It'd help with trimming the ends, which are of awful quality. I think all that is required is splitting every second line down the middle, and also changing the #0/1 to #0/2 for the other FASTQ.

There are no newer datasets of whole cell, total RNA for the ENCODE Tier 1 cell lines, so I would like to work with these files.

fastq awk • 2.2k views
ADD COMMENTlink modified 6.9 years ago by JC9.4k • written 6.9 years ago by dario.garvan460
1
gravatar for Istvan Albert
6.9 years ago by
Istvan Albert ♦♦ 82k
University Park, USA
Istvan Albert ♦♦ 82k wrote:

What you probably want is to separate the reads into different files. Here is a quick script I whipped up with some code duplication but all in the name of being a little more efficient.

Run it as:

python split.py data.fq 1 > pair1.fq

or

python split.py data.fq 2 > pair2.fq

to split the pairs into different files.

import sys

fname, mode = sys.argv[1:]

def parse(stream):
    for id in stream:
        id = id.strip()
        seq = stream.next().strip()
        tmp = stream.next().strip()
        qual = stream.next().strip()
        yield id, seq, tmp, qual

stream = file(fname)

if mode == "1":

    for id, seq, tmp, qual in parse(stream):

        half = len(seq)/2
        seq  = seq[:half]
        qual = qual[:half]

        print id
        print seq
        print tmp
        print qual

else:

    for id, seq, tmp, qual in parse(stream):

        half = len(seq)/2
        seq  = seq[-half:]
        qual = qual[-half:]

        id = id[:-1] + "2"
        tmp = tmp[:-1] + "2"

        print id
        print seq
        print tmp
        print qual
ADD COMMENTlink written 6.9 years ago by Istvan Albert ♦♦ 82k
1
gravatar for Rm
6.9 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Awk (long) one liner:

zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | awk '{OFS="\n"; \
print $0 >"pair.R1.fq"; \ 
sub("#0/1$","#0/2",$0); \ 
print $0 >"pair.R2.fq";  \
getline seqs; \
seqs1 =substr(seqs,0,length(seqs)/2); \
seqs2 =substr(seqs,1+length(seqs)/2,length(seqs)/2); \
getline; \
getline quals; quals1 =substr(quals,0,length(quals)/2);  \
quals2 =substr(quals,1+length(quals)/2,length(quals)/2); \
print seqs1, "+", quals1 >>"pair.R1.fq" ; print seqs2, "+", quals2 >> "pair.R2.fq"  }'
ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Rm7.9k
1
gravatar for JC
6.9 years ago by
JC9.4k
Mexico
JC9.4k wrote:

Perl option:

*edited to write output in separated files, not stdout/stderr

#!/usr/bin/perl

use strict;
use warnings;

my $file1 = shift @ARGV;
my $file2 = shift @ARGV;
open O1, ">$file1" or die;
open O2, ">$file2" or die;
my $n = 0;
while (<>) {
    $n++; 
    chomp; 
    if ($n % 2 == 0) { 
        print O1 substr($_, 0, int((length $_) / 2)), "\n"; 
        print O2 substr($_, int((length $_) / 2) + 1), "\n"; 
    }
    else { 
        print O1 "$_\n"; 
        s/1$/2/; 
        print O2 "$_\n"; 
    }
}
close O1; 
close O2;

Execute in bash as:

zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | perl splitFastq.pl reads_1.fq reads_2.fq
ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by JC9.4k

@JC: be cautious: if any run time warnings of any kind ,they will end up in read_2.fq

ADD REPLYlink written 6.9 years ago by Rm7.9k

true, but can be avoided writing in separated files with minor code changes.

ADD REPLYlink written 6.9 years ago by JC9.4k

Can use FIle handles

ADD REPLYlink written 6.9 years ago by Rm7.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2342 users visited in the last hour