Question

Simple shell command for extracting indexed reads from fastq?

0

Entering edit mode

9.5 years ago

mbk0asis ▴ 680

Hi.

All my reads in fastq files are supposed to have custom index sequences at 5th nucleotide from 5' end.

(for example, 5' NNNN-IndexSeq-restReadSequences)

I tried to demultiplexed them with several tools on the web, but I couldn't get consistent results from those tools.

For just "sanity check", I want to actually count or extract indexed reads with simple shell commands.

I'm not sure if I'm saying right, but most tools search indexes in the whole reads, but I want to look for indexes at specific position (i.e. 5th - 9th nucleotides).

I hope you guys could understand my poor English.

Thank you.

fastq demutiplexing • 3.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by mbk0asis ▴ 680

Ram · Answer 1 · 2014-11-01

3

Entering edit mode

9.5 years ago

Pierre Lindenbaum 161k

use awk. For example to filter the reads having bases 1-5 = "TCTCT"

gunzip -c your.fastq.gz | paste - - - - |\
awk -F '\t'  '(substr($2,1,5)=="TCTCT")' |\
tr "\t" "\n"

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 9.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Shouldn't substr($2,1,5) be substr($2,5,5)? (Get substring from position 5 for length 5)

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by dariober 14k

0

Entering edit mode

I said "bases 1-5"

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 9.5 years ago by Pierre Lindenbaum 161k

Ram · Answer 2 · 2014-11-02

1

Entering edit mode

9.5 years ago

dariober 14k

Expanding Pierre's answer... You might want to check which are the most represented indexes in, say, the first 100000 reads:

zcat reads.fq.gz \
| head -n 100000 \
| paste  - - - - \
| awk '{idx=substr($2, 5, 5); print idx}' \
| sort \
| uniq -c \
| sort -k1,1nr

Remove the last three lines if you just want to see the indexes of each read.

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 9.5 years ago by dariober 14k

Ram · Answer 3 · 2014-11-02

0

Entering edit mode

9.5 years ago

smilefreak ▴ 420

Using grep before and after switches.

# extracts reads.
zcat reads.fq.gz | grep -A 2 -B 1 "^NNNNTCTCTC" > reads_out.txt
# count number of matches
zcat reads.fq.gz | grep "^NNNNTCTCTC" | wc -l

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 9.5 years ago by smilefreak ▴ 420