Question: How to subset fastq data based on leading nt of sequences?
0
gravatar for ddzhangzz
11 months ago by
ddzhangzz90
United States
ddzhangzz90 wrote:

I wanted to extract reads from a fastq format file for the reads that the 1-8 nt with a format of "NNNNNGGG" and saved as a fastq file as well for further alignment. for example for the three reads below:

@SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
GCAGGGGGACCCCATCTCTACTAAAAATACAAAAATTAGACAGACGTGATGGGGCATTTCTCTAATCCCAGCTACT
+SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
AAAAA//EEEEEEE/EEEA/EEEE/E/E6EEAEAE/6EE/E///A/AA</A<</6<A<6EE/A6AEEEE//E6AE/
@SRR8105603.10 NS500418:833:HNY2CBGX5:1:11101:23704:1138 length=75
TGCTCGCGGATCGCTTGAGTCCAGGAGTTCAAGACCAGCCTGGGTAACATGGCAAAACCTCATCTCTACAAAAAA
+SRR8105603.10 NS500418:833:HNY2CBGX5:1:11101:23704:1138 length=75
AAAAAA//EAEAEAEE/AAEEEAEEEEAEEEAEEAEEEA<A<</EA/E6//<EAEE<AAE<EEAAEEEE///</<
@SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
TGCGAGGGGCAAGTTCCTGTTTCCAAACAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGAGGG
+SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76

The 1st and 3rd reads the 1-8 nt meet the "NNNNNGGG" format (GCAGGGGG and TGCGAGGG respectively) so the extracted would be:

@SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
GCAGGGGGACCCCATCTCTACTAAAAATACAAAAATTAGACAGACGTGATGGGGCATTTCTCTAATCCCAGCTACT
+SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
AAAAA//EEEEEEE/EEEA/EEEE/E/E6EEAEAE/6EE/E///A/AA</A<</6<A<6EE/A6AEEEE//E6AE/
@SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
TGCGAGGGGCAAGTTCCTGTTTCCAAACAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGAGGG
+SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
rna-seq • 367 views
ADD COMMENTlink modified 11 months ago by steve2.6k • written 11 months ago by ddzhangzz90

it sounds like the real issue is that the files did not demultiplex correctly with bcl2fastq?

ADD REPLYlink written 11 months ago by steve2.6k
5
gravatar for shenwei356
11 months ago by
shenwei3565.2k
China
shenwei3565.2k wrote:

try seqkit grep:

seqkit grep -p '^NNNNNGGG' -d read_1.fq.gz -o out_1.fq.gz

or

seqkit grep -R 6:8 -p GGG read_1.fq.gz -o out_1.fq.gz
ADD COMMENTlink written 11 months ago by shenwei3565.2k
1
gravatar for steve
11 months ago by
steve2.6k
United States
steve2.6k wrote:

If I was to do this, I would probably write a simple Python script with Biopython using SeqIO; https://biopython.org/wiki/SeqIO

I have an example script here where I subset by the 'qname' of each read: https://github.com/stevekm/fastq-bed-subset/blob/6f19f4e71c70776970b487bf673f54df6adae0f1/bin/subset_fastq.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Subsets a .fastq.gz file to extract only the reads with the matching qnames
"""
import sys
import gzip
from Bio import SeqIO

def main():
    args = sys.argv[1:]
    input_fastq = args[0]
    qnames_txt = args[1]
    output_fastq = args[2]

    qnames = []
    with open(qnames_txt) as fin:
        for line in fin:
            qnames.append(line.strip())

    with gzip.open(input_fastq) as gz_in, gzip.open(output_fastq, 'wb') as gz_out:
        input_seq_iterator = SeqIO.parse(gz_in, "fastq")
        seq_iterator = (record for record in input_seq_iterator if record.id in qnames)
        SeqIO.write(seq_iterator, gz_out, "fastq")

if __name__ == '__main__':
    main()

you would probably want to do something like this but instead check for the first n bases to match some pattern.

Do not expect fast performance though

ADD COMMENTlink written 11 months ago by steve2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1495 users visited in the last hour