Question

Alignment Software To Only Find Number Of Short Read Alignments?

0

Entering edit mode

11.2 years ago

Click downvote ▴ 720

You heard me- I want all exact matches and I do not care where they are, how they look etc. Just how many there are, the output should be one int.

Bowtie can do this but it is incredibly demanding as I can't find a way to turn off the alignments output.

I just want to know that there are 4674e99999 perfect hits of my short read library, not where they are. The map file simply becomes too large.

Said software also needs to support creating indexes for only parts of the genome, ie. the rmsk and promoter regions.

Any software that will let me do this? Or can bowtie be hacked to do what I want?

Ps. --quiet turns of the parts of the results I am interested in, and keeps the parts I do not care about.

short read alignment • 4.2k views

ADD COMMENT • link updated 4.4 years ago by Biostar 20 • written 11.2 years ago by Click downvote ▴ 720

0

Entering edit mode

Pipe the output to /dev/null and the overall statistics (# reads mapped) will show up on stderr. You can build your own indices from whatever fasta files you make (promoters-only, etc.).

ADD REPLY • link 11.2 years ago by matted 7.8k

0

Entering edit mode

Thanks for teaching me about dev/null! Just what I needed!

ADD REPLY • link 11.2 years ago by Click downvote ▴ 720

Ram · Answer 1 · 2013-01-28

If your reference is promoter only, you can write a simplistic hash table based mapper by hashing every k-mer in your small reference. Depending on your read lengths and genome sizes, this may be the fastest solution.
Eland version 1. It reports the number of exact hits. It does not work with reads longer than 32bp, though.
BWA. It gives you the number of best hits at the X0 tag. One caveat is that contigs are concatenated as a single sequences. You may need to add, say, 1000A to the end of your contigs if they are too small. BWA will not be very efficient, but probably it does not matter if you do not have a huge data set.
BWA fastmap. It will be faster than BWA as it only considers the partial exact matches, but still the speed is not optimal as what you want is full-length exact match only. You still need to add long A to avoid cross-contig matches.
Fermi exact. With BWA and fastmap, you need to re-index the reference once you change it. "Fermi exact" index your reads first and then map the reference sequence against the read index. If your reference genome is frequently changing, fermi exact may be more convenient. It is also possible to index the genome with fermi. You won't have the concatenated contig problem, but it will be much slower than BWA fastmap.
Bowtie default. As I remember, bowtie by default gives you the count on one strand of the reference at least. I forget whether it gives the count for both strands. Bowtie also has a similar problem to bwa: contigs are internally concatenated. Naively, I think there is no fast solution. Probably its count is also inaccurate occasionally.
Bowtie -a. This asks bowtie to output all hits. It is not recommended unless you are working on small data sets. For all FM-index aligners, reporting the positions of all hits make them much slower. Also, as I remember, an early version of bowtie reported fewer hits than Eland and bwa (these two agreed). I have not done similar experiment with more recent versions.

My recommendation is 4.