Question: How to subset fastq data based on leading nt of sequences?
0
gravatar for ddzhangzz
4 weeks ago by
ddzhangzz90
United States
ddzhangzz90 wrote:

I wanted to extract reads from a fastq format file for the reads that the 1-8 nt with a format of "NNNNNGGG" and saved as a fastq file as well for further alignment. for example for the three reads below:

@SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
GCAGGGGGACCCCATCTCTACTAAAAATACAAAAATTAGACAGACGTGATGGGGCATTTCTCTAATCCCAGCTACT
+SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
AAAAA//EEEEEEE/EEEA/EEEE/E/E6EEAEAE/6EE/E///A/AA</A<</6<A<6EE/A6AEEEE//E6AE/
@SRR8105603.10 NS500418:833:HNY2CBGX5:1:11101:23704:1138 length=75
TGCTCGCGGATCGCTTGAGTCCAGGAGTTCAAGACCAGCCTGGGTAACATGGCAAAACCTCATCTCTACAAAAAA
+SRR8105603.10 NS500418:833:HNY2CBGX5:1:11101:23704:1138 length=75
AAAAAA//EAEAEAEE/AAEEEAEEEEAEEEAEEAEEEA<A<</EA/E6//<EAEE<AAE<EEAAEEEE///</<
@SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
TGCGAGGGGCAAGTTCCTGTTTCCAAACAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGAGGG
+SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76

The 1st and 3rd reads the 1-8 nt meet the "NNNNNGGG" format (GCAGGGGG and TGCGAGGG respectively) so the extracted would be:

@SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
GCAGGGGGACCCCATCTCTACTAAAAATACAAAAATTAGACAGACGTGATGGGGCATTTCTCTAATCCCAGCTACT
+SRR8105603.9 NS500418:833:HNY2CBGX5:1:11101:10498:1122 length=76
AAAAA//EEEEEEE/EEEA/EEEE/E/E6EEAEAE/6EE/E///A/AA</A<</6<A<6EE/A6AEEEE//E6AE/
@SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
TGCGAGGGGCAAGTTCCTGTTTCCAAACAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGAGGG
+SRR8105603.11 NS500418:833:HNY2CBGX5:1:11101:26002:1139 length=76
rna-seq • 157 views
ADD COMMENTlink modified 4 weeks ago by steve2.3k • written 4 weeks ago by ddzhangzz90

it sounds like the real issue is that the files did not demultiplex correctly with bcl2fastq?

ADD REPLYlink written 4 weeks ago by steve2.3k
6
gravatar for shenwei356
4 weeks ago by
shenwei3564.8k
China
shenwei3564.8k wrote:

try seqkit grep:

seqkit grep -p '^NNNNNGGG' -d read_1.fq.gz -o out_1.fq.gz

or

seqkit grep -R 6:8 -p GGG read_1.fq.gz -o out_1.fq.gz
ADD COMMENTlink written 4 weeks ago by shenwei3564.8k
1
gravatar for steve
4 weeks ago by
steve2.3k
United States
steve2.3k wrote:

If I was to do this, I would probably write a simple Python script with Biopython using SeqIO; https://biopython.org/wiki/SeqIO

I have an example script here where I subset by the 'qname' of each read: https://github.com/stevekm/fastq-bed-subset/blob/6f19f4e71c70776970b487bf673f54df6adae0f1/bin/subset_fastq.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Subsets a .fastq.gz file to extract only the reads with the matching qnames
"""
import sys
import gzip
from Bio import SeqIO

def main():
    args = sys.argv[1:]
    input_fastq = args[0]
    qnames_txt = args[1]
    output_fastq = args[2]

    qnames = []
    with open(qnames_txt) as fin:
        for line in fin:
            qnames.append(line.strip())

    with gzip.open(input_fastq) as gz_in, gzip.open(output_fastq, 'wb') as gz_out:
        input_seq_iterator = SeqIO.parse(gz_in, "fastq")
        seq_iterator = (record for record in input_seq_iterator if record.id in qnames)
        SeqIO.write(seq_iterator, gz_out, "fastq")

if __name__ == '__main__':
    main()

you would probably want to do something like this but instead check for the first n bases to match some pattern.

Do not expect fast performance though

ADD COMMENTlink written 4 weeks ago by steve2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1922 users visited in the last hour