Subset Fastq file from list of read numbers
2
0
Entering edit mode
4.5 years ago
johnsonn573 ▴ 10

I have a txt file each line of which is a number corresponding to a specific read in a fastq file. I would like to make a subsetted fastq file from my larger fastq file with just the reads corresponding to the numbers in the txt file. Is there a simple way to do this? Thank you!

fastq subset reads • 2.0k views
ADD COMMENT
0
Entering edit mode

not clear: what is that number ? the line number starting from 0 ? from 1 ? the fastq record in the file ? starting from 0 ? from 1 ?

ADD REPLY
0
Entering edit mode

Please post few records from input files: fastq and text. In the absense of them, i would suggest to use seqkit grep/range function @ johnsonn573

ADD REPLY
0
Entering edit mode

Problem is OP here only has record numbers (odd) and not fastq headers, as far as I see.

ADD REPLY
0
Entering edit mode

some thing like this? @ genomax @ johnsonn573. Example is with .fasta file. Same code works for fastq file and user needs to replace input fasta with input fastq. For fastq code would be: parallel seqkit range -r {}:{} test.fq :::: test.txt

input:

$ cat test.fa
>abc
atgc
>cdg
atgc
>def
atgc

$ cat test.txt 
2
3

output:

$ parallel seqkit range -r {}:{} test.fa :::: test.txt
>cdg
atgc
>def
atgc
ADD REPLY
2
Entering edit mode
4.5 years ago
GenoMax 142k

Doing this by just (record?) number is going to be tricky. If you have read headers it would be much simpler to use filterbyname.sh from BBMap suite.

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).

Run filterbyname.sh without any options to see in-line help.

ADD COMMENT
0
Entering edit mode

Thank you for letting me know about this. I changed my script so that I could filter using filterbyname.sh.

ADD REPLY
2
Entering edit mode
4.5 years ago

assuming the number are the 1-based index of each record in the fastq.

join -t $'\t' -1 1 -2 1 \
   <(gunzip -c input.fq.gz | paste - - - - | awk '{printf("%d\t%s\n",NR,$0);}' | sort -T . -t $'\t' -k1,1 ) \
   <(sort -T . numbers.txt) | \
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n" > subset.fastq
ADD COMMENT
0
Entering edit mode

Yes, the numbers start from 1. Thank you!

ADD REPLY
0
Entering edit mode

My fastq file is not gzipped. Is there a way to modify this script so it works on an unzipped input.fq file?

ADD REPLY
2
Entering edit mode

At this point, I'm afraid you'll have to learn it by yourself.

ADD REPLY
0
Entering edit mode

That's fine. I tried gzipping the fastq file, and running as input.fq.gz, but the script still wouldn't run. So there must be another problem. I will post when I have a functioning script.

ADD REPLY
0
Entering edit mode

but the script still wouldn't run.

You'll have to be more precise if you expect any help.

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6