Alignment Software To Only Find Number Of Short Read Alignments?
1
0
Entering edit mode
11.2 years ago

You heard me- I want all exact matches and I do not care where they are, how they look etc. Just how many there are, the output should be one int.

Bowtie can do this but it is incredibly demanding as I can't find a way to turn off the alignments output.

I just want to know that there are 4674e99999 perfect hits of my short read library, not where they are. The map file simply becomes too large.

Said software also needs to support creating indexes for only parts of the genome, ie. the rmsk and promoter regions.

Any software that will let me do this? Or can bowtie be hacked to do what I want?

Ps. --quiet turns of the parts of the results I am interested in, and keeps the parts I do not care about.

short read alignment • 4.2k views
ADD COMMENT
0
Entering edit mode

Pipe the output to /dev/null and the overall statistics (# reads mapped) will show up on stderr. You can build your own indices from whatever fasta files you make (promoters-only, etc.).

ADD REPLY
0
Entering edit mode

Thanks for teaching me about dev/null! Just what I needed!

ADD REPLY
6
Entering edit mode
11.2 years ago
lh3 33k
  1. If your reference is promoter only, you can write a simplistic hash table based mapper by hashing every k-mer in your small reference. Depending on your read lengths and genome sizes, this may be the fastest solution.
  2. Eland version 1. It reports the number of exact hits. It does not work with reads longer than 32bp, though.
  3. BWA. It gives you the number of best hits at the X0 tag. One caveat is that contigs are concatenated as a single sequences. You may need to add, say, 1000A to the end of your contigs if they are too small. BWA will not be very efficient, but probably it does not matter if you do not have a huge data set.
  4. BWA fastmap. It will be faster than BWA as it only considers the partial exact matches, but still the speed is not optimal as what you want is full-length exact match only. You still need to add long A to avoid cross-contig matches.
  5. Fermi exact. With BWA and fastmap, you need to re-index the reference once you change it. "Fermi exact" index your reads first and then map the reference sequence against the read index. If your reference genome is frequently changing, fermi exact may be more convenient. It is also possible to index the genome with fermi. You won't have the concatenated contig problem, but it will be much slower than BWA fastmap.
  6. Bowtie default. As I remember, bowtie by default gives you the count on one strand of the reference at least. I forget whether it gives the count for both strands. Bowtie also has a similar problem to bwa: contigs are internally concatenated. Naively, I think there is no fast solution. Probably its count is also inaccurate occasionally.
  7. Bowtie -a. This asks bowtie to output all hits. It is not recommended unless you are working on small data sets. For all FM-index aligners, reporting the positions of all hits make them much slower. Also, as I remember, an early version of bowtie reported fewer hits than Eland and bwa (these two agreed). I have not done similar experiment with more recent versions.

My recommendation is 4.

ADD COMMENT
1
Entering edit mode

+1 Very thorough response. I would add that Vmatch can be used for exact matches and might be a fast, easy alternative to #1 above. I don't know how it compares to the other tools for alignment, but for custom matching tasks, it is a great tool. I'd avoid indexing the reads with Vmatch though, that would create very large files and would not be the most efficient approach.

ADD REPLY

Login before adding your answer.

Traffic: 2669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6