Question

Count specific sequences in a .fasta file

0

Entering edit mode

20 months ago

x_ma_x • 0

This is probably a super-simple problem, but I must be searching it wrong because I can't find anything useful!

I basically have an NGS .fastq file with millions of reads of a PCR amplicon library. I want to check the coverage of my library against a reference .fasta of 100k sequences, and count the frequencies of all those reference sequences. At this point I don't care about mutations, so this is a simple find all and count problem (except at a rather larger scale).

What's the simples way of doing it? Or what should I be searching for?

frequency library fasta count • 1.3k views

ADD COMMENT • link updated 20 months ago by lieven.sterck 15k • written 20 months ago by x_ma_x • 0

1

Entering edit mode

A Python3 program that counts sequence occurrences in raw FASTQ files

https://github.com/afombravo/2FAST2Q seems could do this.

ADD REPLY • link 20 months ago by shenwei356 8.4k

score 2 · Answer 1 · 2022-08-30

2

Entering edit mode

20 months ago

lieven.sterck 15k

Mapping your reads to the reference fasta file. Then do counting of the mapping results.

You should be searching for: "NGS read mapping", "counting reads", "coverage plots" , ...

More specifically you can search for tools such as: STAR, HiSAT2, BBmap, samtools, ...

ADD COMMENT • link 20 months ago by lieven.sterck 15k

0

Entering edit mode

Though if your reference sequences are shorter than your reads, you'll probably want to trim the reads or pad the refs.

ADD REPLY • link 20 months ago by swbarnes2 14k

0

Entering edit mode

sure thing, unfortunately my answer pre-dated that addition of info

ADD REPLY • link 20 months ago by lieven.sterck 15k

score 1 · Answer 2 · 2022-08-30

1

Entering edit mode

20 months ago

GenoMax 141k

Are 100K reference sequences unique i.e. they don't have any sequence overlap? What is the length of sample sequences and is it fixed?

If they are unique then you could use bbduk.sh in filter mode to find sequences that match each one of your reference (you can search using the reference). You could use clumpify.sh to count reads and reduce them to single sequence representation (and then search with your reference).

Guide for BBduk.sh: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/
Clumpify - Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

Or you could simply use blat to search the two files against each other and parse the result file to get counts.

ADD COMMENT • link 20 months ago by GenoMax 141k

0

Entering edit mode

The reference sequences are all 120bp, of which 87bp in the middle are unique to each sequence. The remaining 33bp are primer handles used to create the library. I can in theory trim those and leave only the unique sequences.

The actual sample fasta is more messy, with sequences of different lengths, and in most cases are longer than reference. For this reason, I'm not looking for occurrences of particular strings (ie. reference sequences) - if that makes sense. In other words, if my reference is GGGG then TGGGGT should still be counted.

ADD REPLY • link 20 months ago by x_ma_x • 0

0

Entering edit mode

I can in theory trim those and leave only the unique sequences.

It may be best to do that and then use the unique sequences with bbduk.sh in filter mode. It would allow you to satisfy the following requirement

if my reference is GGGG then TGGGGT should still be counted.

ADD REPLY • link 20 months ago by GenoMax 141k