Question: Subset Fastq file from list of read numbers
0
gravatar for johnsonn573
12 months ago by
johnsonn5730 wrote:

I have a txt file each line of which is a number corresponding to a specific read in a fastq file. I would like to make a subsetted fastq file from my larger fastq file with just the reads corresponding to the numbers in the txt file. Is there a simple way to do this? Thank you!

subset fastq reads • 515 views
ADD COMMENTlink modified 12 months ago by Pierre Lindenbaum131k • written 12 months ago by johnsonn5730

not clear: what is that number ? the line number starting from 0 ? from 1 ? the fastq record in the file ? starting from 0 ? from 1 ?

ADD REPLYlink written 12 months ago by Pierre Lindenbaum131k

Please post few records from input files: fastq and text. In the absense of them, i would suggest to use seqkit grep/range function @ johnsonn573

ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011214k

Problem is OP here only has record numbers (odd) and not fastq headers, as far as I see.

ADD REPLYlink written 12 months ago by genomax92k

some thing like this? @ genomax @ johnsonn573. Example is with .fasta file. Same code works for fastq file and user needs to replace input fasta with input fastq. For fastq code would be: parallel seqkit range -r {}:{} test.fq :::: test.txt

input:

$ cat test.fa
>abc
atgc
>cdg
atgc
>def
atgc

$ cat test.txt 
2
3

output:

$ parallel seqkit range -r {}:{} test.fa :::: test.txt
>cdg
atgc
>def
atgc
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011214k
2
gravatar for genomax
12 months ago by
genomax92k
United States
genomax92k wrote:

Doing this by just (record?) number is going to be tricky. If you have read headers it would be much simpler to use filterbyname.sh from BBMap suite.

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).

Run filterbyname.sh without any options to see in-line help.

ADD COMMENTlink modified 12 months ago • written 12 months ago by genomax92k

Thank you for letting me know about this. I changed my script so that I could filter using filterbyname.sh.

ADD REPLYlink written 12 months ago by johnsonn5730
2
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

assuming the number are the 1-based index of each record in the fastq.

join -t $'\t' -1 1 -2 1 \
   <(gunzip -c input.fq.gz | paste - - - - | awk '{printf("%d\t%s\n",NR,$0);}' | sort -T . -t $'\t' -k1,1 ) \
   <(sort -T . numbers.txt) | \
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n" > subset.fastq
ADD COMMENTlink written 12 months ago by Pierre Lindenbaum131k

Yes, the numbers start from 1. Thank you!

ADD REPLYlink written 12 months ago by johnsonn5730

My fastq file is not gzipped. Is there a way to modify this script so it works on an unzipped input.fq file?

ADD REPLYlink written 12 months ago by johnsonn5730
2

At this point, I'm afraid you'll have to learn it by yourself.

ADD REPLYlink written 12 months ago by Pierre Lindenbaum131k

That's fine. I tried gzipping the fastq file, and running as input.fq.gz, but the script still wouldn't run. So there must be another problem. I will post when I have a functioning script.

ADD REPLYlink written 12 months ago by johnsonn5730

but the script still wouldn't run.

You'll have to be more precise if you expect any help.

ADD REPLYlink written 12 months ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1684 users visited in the last hour