Question: Subset Fastq file from list of read numbers
0
gravatar for johnsonn573
21 days ago by
johnsonn5730 wrote:

I have a txt file each line of which is a number corresponding to a specific read in a fastq file. I would like to make a subsetted fastq file from my larger fastq file with just the reads corresponding to the numbers in the txt file. Is there a simple way to do this? Thank you!

subset fastq reads • 200 views
ADD COMMENTlink modified 21 days ago by Pierre Lindenbaum124k • written 21 days ago by johnsonn5730

not clear: what is that number ? the line number starting from 0 ? from 1 ? the fastq record in the file ? starting from 0 ? from 1 ?

ADD REPLYlink written 21 days ago by Pierre Lindenbaum124k

Please post few records from input files: fastq and text. In the absense of them, i would suggest to use seqkit grep/range function @ johnsonn573

ADD REPLYlink modified 19 days ago • written 19 days ago by cpad011212k

Problem is OP here only has record numbers (odd) and not fastq headers, as far as I see.

ADD REPLYlink written 19 days ago by genomax75k

some thing like this? @ genomax @ johnsonn573. Example is with .fasta file. Same code works for fastq file and user needs to replace input fasta with input fastq. For fastq code would be: parallel seqkit range -r {}:{} test.fq :::: test.txt

input:

$ cat test.fa
>abc
atgc
>cdg
atgc
>def
atgc

$ cat test.txt 
2
3

output:

$ parallel seqkit range -r {}:{} test.fa :::: test.txt
>cdg
atgc
>def
atgc
ADD REPLYlink modified 18 days ago • written 18 days ago by cpad011212k
2
gravatar for genomax
21 days ago by
genomax75k
United States
genomax75k wrote:

Doing this by just (record?) number is going to be tricky. If you have read headers it would be much simpler to use filterbyname.sh from BBMap suite.

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).

Run filterbyname.sh without any options to see in-line help.

ADD COMMENTlink modified 21 days ago • written 21 days ago by genomax75k

Thank you for letting me know about this. I changed my script so that I could filter using filterbyname.sh.

ADD REPLYlink written 18 days ago by johnsonn5730
2
gravatar for Pierre Lindenbaum
21 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

assuming the number are the 1-based index of each record in the fastq.

join -t $'\t' -1 1 -2 1 \
   <(gunzip -c input.fq.gz | paste - - - - | awk '{printf("%d\t%s\n",NR,$0);}' | sort -T . -t $'\t' -k1,1 ) \
   <(sort -T . numbers.txt) | \
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n" > subset.fastq
ADD COMMENTlink written 21 days ago by Pierre Lindenbaum124k

Yes, the numbers start from 1. Thank you!

ADD REPLYlink written 21 days ago by johnsonn5730

My fastq file is not gzipped. Is there a way to modify this script so it works on an unzipped input.fq file?

ADD REPLYlink written 19 days ago by johnsonn5730
2

At this point, I'm afraid you'll have to learn it by yourself.

ADD REPLYlink written 19 days ago by Pierre Lindenbaum124k

That's fine. I tried gzipping the fastq file, and running as input.fq.gz, but the script still wouldn't run. So there must be another problem. I will post when I have a functioning script.

ADD REPLYlink written 19 days ago by johnsonn5730

but the script still wouldn't run.

You'll have to be more precise if you expect any help.

ADD REPLYlink written 19 days ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 989 users visited in the last hour