Hello, I can't find an easy solution to extract the last N bases of all reads from a FASTQ file, is there an easy solution with prinseq-lite, cutadapt or Fastx-toolkit to do it ? Or an other tool. I know i can do it on my own with either a java or a perl script, but well it seems so obvious that one of those program can do it in a few seconds, but i do not find the way to do it. thanks a lot to save me time, and thanks for this so resourcefull forum. kevin
I don't have time to check it on a FASTQ file, but in python:
with open('input.fastq','r') as f: toggle = False for line in f: if toggle: print line[-5:-1] else: print line[ 0:-1] toggle = not toggle
Change the path of input.fastq, save the code as dangerous.py, and run
python dangerous.py > new.fastq
Works on reads of different lengths, and outputs the format you were looking for.
give bbduk.sh from the bbmap suite a try. It will be something like:
cat myfile.fq | bbduk.sh in=stdin.fq forcetrimright=$N out=stdout.fq minlength=$N
AFAIK, bbduk is 0-based; so if you want to have the last 5 nt, your N has to be 4.
I'm wondering why the No.1 user Biostar modified some threads and brought them to homepage these days.
But seqkit can do this easily.
N=100 gzip -d -c read_1.fq.gz | seqkit subseq -r -$N:-1 | gzip -c > read_1.t.fq.gz
The definition of region is 1-based and with some custom design.
1-based index 1 2 3 4 5 6 7 8 9 10 negative index 0-9-8-7-6-5-4-3-2-1 seq A C G T N a c g t n 1:1 A 2:4 C G T -4:-2 c g t -4:-1 c g t n -1:-1 n 2:-2 C G T N a c g t 1:-1 A C G T N a c g t n 1:12 A C G T N a c g t n -12:-1 A C G T N a c g t n
$ echo -e "@seq\nACGTNacgtn\n+\nGGGGGCCCCC" @seq ACGTNacgtn + GGGGGCCCCC
first 6 bases:
$ echo -e "@seq\nACGTNacgtn\n+\nGGGGGCCCCC" | seqkit subseq -r 1:6 @seq ACGTNa + GGGGGC
last 6 bases:
$ echo -e "@seq\nACGTNacgtn\n+\nGGGGGCCCCC" | seqkit subseq -r -6:-1 @seq Nacgtn + GCCCCC