Count and location of strings in fastq file reads
3
1
Entering edit mode
7.7 years ago
rubic ▴ 270

Hi,

I would like to count the occurrence of a set of strings in reads in fastq files, per location of the match in the read sequence. More specifically I'd like to count the occurrence of a match to a string such as this: ACAGTA**TATAAGTATGG Where * can be any nucleotide.

So the result would be per each read position how many matches were counted for the query.

Is there any tool, preferably an R package that can do that? Efficiently?

RNA-Seq kmer suffix tree • 6.8k views
ADD COMMENT
1
Entering edit mode

See @Brian's answer below.

I assume you have generated the 150 bp reads referred to in the other thread. You can use seal.sh to try and demultiplex them as shown below.

ADD REPLY
0
Entering edit mode

See this thread: searching reads with a certain sequence in fastq file Adding -c option to grep should get you counts.

ADD REPLY
3
Entering edit mode
7.7 years ago

Here's the solution of seqkit.

Preparing pattern file in FASTA format, replacing * with N

$ more patterns.fa 
>Pattern_A
ACAGTANNTATA
>Pattern_B
ATACGACGTANNCG

Searching on a FASTQ file with 9M SE100 reads.

$ seqkit locate --degenerate --pattern-file patterns.fa dataset_C.fq > result.tsv

elapsed time: 1m:22s
peak rss: 15.05 MB

The computation bottleneck is finding all matches with regular expression, which is not effective in Golang. Time is linearly dependent to the number of patterns.

1 pattern   1m:01s
2 patterns  1m:21s
3 patterns  1m:44s

Result:

seqID                                     patternName   pattern          strand   start   end   matched
K00137:236:H7NLVBBXX:6:2103:25702:14344   Pattern_A     ACAGTANNTATA     +        35      46    ACAGTATCTATA
K00137:236:H7NLVBBXX:6:1104:15412:4462    Pattern_A     ACAGTANNTATA     -        32      43    ACAGTACCTATA
K00137:236:H7NLVBBXX:6:1107:31081:13429   Pattern_A     ACAGTANNTATA     -        9       20    ACAGTAAATATA
K00137:236:H7NLVBBXX:6:1117:29894:16295   Pattern_B     ATACGACGTANNCG   -        58      71    ATACGACGTAAACG
K00137:236:H7NLVBBXX:6:1226:5152:25931    Pattern_A     ACAGTANNTATA     -        26      37    ACAGTATCTATA

Counting needs some help of shell

$ seqkit fx2tab patterns.fa | cut -f 2 |  while read pattern <&0; do echo -e $pattern"\t"$(grep -c $pattern result.tsv); done
ACAGTANNTATA    1241
ATACGACGTANNCG  24
ADD COMMENT
2
Entering edit mode
7.7 years ago

You can count how many reads match the pattern with BBDuk (and capture them), like this:

bbduk.sh in=reads.fq outm=matched.fq literal=ACAGTANNTATAAGTATGG k=19 copyundefined mm=f

It's extremely fast. That will also match reverse-complements, so if you don't want that, add "rcomp=f".

ADD COMMENT
0
Entering edit mode

Can this be used to demultiplex data (that is where @rubic's question is coming from, there are inline barcodes involved and every read needs to be cut down by a fixed length on right)? I seem to vaguely recollect that being a possibility.

ADD REPLY
1
Entering edit mode

Seal can demultiplex, and has the same "copyundefined" functionality for generating all possible variations of degenerate bases. For example:

seal.sh in=reads.fq ref=barcodes.fa pattern=out_%.fq outu=unmatched.fq mm=f k=19

For the file "barcodes.fa":

>X
ACAGTANNTATAAGTATGG
>Y
ACAGTANNTATAAGTACCC
>Z
ACAGTANNTATAAGTAAAA

...it would produce 4 output files:

out_X.fq
out_Y.fq
out_Z.fq
unmatched.fq

If the match was only desired in the first 19bp, due to inline barcodes, the flag "restrictleft=19" or "restrictright=19" could be added. The inline barcodes could be trimmed in a second pass with the "ftr" or "ftl" flag.

ADD REPLY
1
Entering edit mode
7.7 years ago
gunzip -c input.fastq.gz |\
paste - - - - | cut -f 2 |\
awk  '{x1=index($0,"ACAGTA"); if(x1==0) next; x2=index($0,"TATAAGTATGG"); if(x2>x1) print x1;}' |\
sort | uniq -c
ADD COMMENT

Login before adding your answer.

Traffic: 1462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6