How to subset fastq data based on leading nt of sequences?
2
0
Entering edit mode
4.7 years ago
ddzhangzz ▴ 90

I wanted to extract reads from a fastq format file for the reads that the 1-8 nt with a format of "NNNNNGGG" and saved as a fastq file as well for further alignment. for example for the three reads below:

@SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
GCAGGGGGACCCCATCTCTACTAAAAATACAAAAATTAGACAGACGTGATGGGGCATTTCTCTAATCCCAGCTACT
+SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
AAAAA//EEEEEEE/EEEA/EEEE/E/E6EEAEAE/6EE/E///A/AA</A<</6<A<6EE/A6AEEEE//E6AE/
@SRR8105603.10 NS500418:833:HNY2CBGX5:1:11101:23704:1138 length=75
TGCTCGCGGATCGCTTGAGTCCAGGAGTTCAAGACCAGCCTGGGTAACATGGCAAAACCTCATCTCTACAAAAAA
+SRR8105603.10 NS500418:833:HNY2CBGX5:1:11101:23704:1138 length=75
AAAAAA//EAEAEAEE/AAEEEAEEEEAEEEAEEAEEEA<A<</EA/E6//<EAEE<AAE<EEAAEEEE///</<
@SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
TGCGAGGGGCAAGTTCCTGTTTCCAAACAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGAGGG
+SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76

The 1st and 3rd reads the 1-8 nt meet the "NNNNNGGG" format (GCAGGGGG and TGCGAGGG respectively) so the extracted would be:

@SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
GCAGGGGGACCCCATCTCTACTAAAAATACAAAAATTAGACAGACGTGATGGGGCATTTCTCTAATCCCAGCTACT
+SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
AAAAA//EEEEEEE/EEEA/EEEE/E/E6EEAEAE/6EE/E///A/AA</A<</6<A<6EE/A6AEEEE//E6AE/
@SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
TGCGAGGGGCAAGTTCCTGTTTCCAAACAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGAGGG
+SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
RNA-Seq • 1.9k views
ADD COMMENT
0
Entering edit mode

it sounds like the real issue is that the files did not demultiplex correctly with bcl2fastq?

ADD REPLY
5
Entering edit mode
4.7 years ago

try seqkit grep:

seqkit grep -p '^NNNNNGGG' -d read_1.fq.gz -o out_1.fq.gz

or

seqkit grep -R 6:8 -p GGG read_1.fq.gz -o out_1.fq.gz
ADD COMMENT
1
Entering edit mode
4.7 years ago
steve ★ 3.5k

If I was to do this, I would probably write a simple Python script with Biopython using SeqIO; https://biopython.org/wiki/SeqIO

I have an example script here where I subset by the 'qname' of each read: https://github.com/stevekm/fastq-bed-subset/blob/6f19f4e71c70776970b487bf673f54df6adae0f1/bin/subset_fastq.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Subsets a .fastq.gz file to extract only the reads with the matching qnames
"""
import sys
import gzip
from Bio import SeqIO

def main():
    args = sys.argv[1:]
    input_fastq = args[0]
    qnames_txt = args[1]
    output_fastq = args[2]

    qnames = []
    with open(qnames_txt) as fin:
        for line in fin:
            qnames.append(line.strip())

    with gzip.open(input_fastq) as gz_in, gzip.open(output_fastq, 'wb') as gz_out:
        input_seq_iterator = SeqIO.parse(gz_in, "fastq")
        seq_iterator = (record for record in input_seq_iterator if record.id in qnames)
        SeqIO.write(seq_iterator, gz_out, "fastq")

if __name__ == '__main__':
    main()

you would probably want to do something like this but instead check for the first n bases to match some pattern.

Do not expect fast performance though

ADD COMMENT

Login before adding your answer.

Traffic: 2111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6