Simple shell command for extracting indexed reads from fastq?
3
0
Entering edit mode
9.5 years ago
mbk0asis ▴ 680

Hi.

All my reads in fastq files are supposed to have custom index sequences at 5th nucleotide from 5' end.

(for example, 5' NNNN-IndexSeq-restReadSequences)

I tried to demultiplexed them with several tools on the web, but I couldn't get consistent results from those tools.

For just "sanity check", I want to actually count or extract indexed reads with simple shell commands.

I'm not sure if I'm saying right, but most tools search indexes in the whole reads, but I want to look for indexes at specific position (i.e. 5th - 9th nucleotides).

I hope you guys could understand my poor English.

Thank you.

fastq demutiplexing • 3.8k views
ADD COMMENT
3
Entering edit mode
9.5 years ago

use awk. For example to filter the reads having bases 1-5 = "TCTCT"

gunzip -c your.fastq.gz | paste - - - - |\
awk -F '\t'  '(substr($2,1,5)=="TCTCT")' |\
tr "\t" "\n"
ADD COMMENT
1
Entering edit mode

Shouldn't substr($2,1,5) be substr($2,5,5)? (Get substring from position 5 for length 5)

ADD REPLY
0
Entering edit mode

I said "bases 1-5"

ADD REPLY
1
Entering edit mode
9.5 years ago

Expanding Pierre's answer... You might want to check which are the most represented indexes in, say, the first 100000 reads:

zcat reads.fq.gz \
| head -n 100000 \
| paste  - - - - \
| awk '{idx=substr($2, 5, 5); print idx}' \
| sort \
| uniq -c \
| sort -k1,1nr

Remove the last three lines if you just want to see the indexes of each read.

ADD COMMENT
0
Entering edit mode
9.5 years ago
smilefreak ▴ 420

Using grep before and after switches.

# extracts reads.
zcat reads.fq.gz | grep -A 2 -B 1 "^NNNNTCTCTC" > reads_out.txt
# count number of matches
zcat reads.fq.gz | grep "^NNNNTCTCTC" | wc -l
ADD COMMENT

Login before adding your answer.

Traffic: 4027 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6