Filter nanopore fastq files by start time
2
1
Entering edit mode
4.8 years ago
sendhelp ▴ 10

Hi,

I have a set of nanopore fastq files with headers in the following format:

@29d30612-b8b4-4ab5-967f-531fc2851541 runid=0a82096db062bcf8041358b407ab5e0aa5de7138 read=23606 ch=186 start_time=2019-03-11T16:16:33Z flow_cell_id= protocol_group_id=190311_ang_1 sample_id=190311_ang_1 TCAGTGTACTTCGTTCAATCGCGTTTTCGTGCGCTTTCAGCAAATCTGCCTCACCTCTCCTCTCCACA

I'd like to filter through the fastq reads and extract only those reads that occured in a given time period (e.g. the first hour). I've tried using poretools --end function but can't work out the correct format of the timestamp to use. Any help would be much appreciated, either with the correct format for poretools or a command that would enable me to do this!

Thanks!

nanopore fastq filter start time • 2.3k views
ADD COMMENT
1
Entering edit mode
4.8 years ago
GenoMax 141k

There are a couple of solutions in this thread: Extract fastq sequences based on date/time (which is in the header)

ADD COMMENT
0
Entering edit mode

Thank you! That's solved it!

ADD REPLY
0
Entering edit mode
4.8 years ago

You might try a grep and include the following three lines (as FASTQ contains four lines per read.

So if you want data from 1600-1659 try

grep -A3 "2019-03-11T16" x.fastq > test.fastq

Then to see how many reads you filtered out

wc -l x.fastq

wc -l test.fastq

Check the format is ok too

head test.fastq

Not tested, but the approach may still work :-)

ADD COMMENT
0
Entering edit mode

I found a python script from the link below that works well so haven't tried this but thank you for the help!

ADD REPLY

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6