Entering edit mode
6.9 years ago
Anand Rao
▴
640
I have the following tasks that I need to carry out
a. BAM mapping of Illumina HiSeq 4000 paired end 2150bp reads to *de novo assembly inferred from those reads,
b. Sort BAM file by coordinates,
c. Index BAM file,
d. Mark or Remove Duplicates in BAM file.
When I was looking into software to generate my reads <=> assembly BAM file, there are so many options: https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software
Are there any top recommendation(s) from the community, and rationale behing those recommendations (based on differences amongst this long list of software)?
Why don't you answer the take home of your link:
Given you have standard Illumina data, use BWA mem for alignment, piped into sambamba sort. The indexing is done on the fly. Use sambamba markdup to remove duplicates. What data you work on, and what are the downstream tasks?
This is a personal choice to a large extent. You should get more or less similar answers for most aligners (assuming you are mindful of the options as to how the aligners handle multi-mappers, what seed they use and if they can produce deterministic output).
I say, remove duplicates by clumpify, map using bbmap, pipe into samtools to sort and finally index.
BTW: Is "reproducing someone else's analysis" phase over or not yet?
Not replicating prior results any more, good memory genomax. Thanks to useful inputs from forum members, including yourself, I've figured why and how I was able to improve upon prior results in terms of contiguity (QUAST, no reference) and completeness (BUSCO) - I'm now working on assessing correctness with REAPR. Have strange results, so looking to use a different mapper to check if SMALT mapper might be introducing some artefacts...
I need to map my reads to the assembly with exact matches - does that mean I need an exhaustive search mapper rather a heuristics based one? Is that what you alluded to in your response? And I suppose bbmap is deterministic, which would fit my purposes well, yes?
You could do only an exact match but the assembly you have is likely a consensus of multiple reads so you should consider allowing at least one or two errors over 150 bp read. BBMap can be deterministic with
deterministic=t
flag. This flag is only applicable for paired-end reads.bwa-mem appears to have no flag to specify how many maximum numbers of mismatches are allowed, someone please correct me if I am mistaken. Regardless, for bbmap.sh, what would that flag be, to specify max. no. of mismatches?
My university HPCC is inaccessible because of generator test - so I am unable to look at command line menu right now. Meanwhile at https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/, it's not entirely clear which flags/parameters to use to specify <= 2 mismatches per read in PE pair. Could you guide me in the right direction?
BBMap does not have a specific number of mismatchs flag. Instead it has has an "idfilter" flag - e.g. "idfilter=0.90", which will discard reads that don't map with at least 90% identity, regardless of their length. It also has an "idtag" flag, which will annotate all the alignments in the sam file with the percent identity, so you can manually filter them if you want. There are also mind/minratio and subfilter options. Take a look at the inline help.
Tagging: Brian Bushnell He can add more information.
If you're an experienced analyst, bioinformatician, or computational biologist (or whatever else 'they' call us), then you'll know that the different aligners are generally targeting different types of reads. As genomax says, they'll all mostly produce the same result in any case.
For example,
bowtie
/bowtie2
is good for short reads >30bp and is also useful if you only want uniquely-mapped reads (you have to specify this setting in the parameters though), whereasBWA mem
is designed for reads > 70bp in length (however, you can use bowtie for these reads too). Then you have specialized aligners likeTopHat
(which ultimately uses bowtie) that is tailoured for RNA-seq and the search for splice-site junctions. There are many other aligners that are specialized, likeKallisto
('pseudo'-aligners for RNA-seq that align your reads using a de Bruijn graph to a reference FASTA CDS transcriptome).As you become more experienced, you'll instantly know which one to employ in which situation.
I cannot comment on aligners for long (>1000bp reads).
What you should do is check your read length and then choose your aligner. You can use this quick script to check the read lengths that exist in any fastq.gz file:
zcat MyReads.fastq.gz | awk '{if(NR%4==2) print length($1)}' | uniq
...or just use a tool like
FastQC
(or look at the HiSeq report for the run).Given the use of an Illumina HiSeq 4000 sequencer, I'm pretty sure they meant 2x150bp, and not 2150bp reads.
For your need, the recommended tool is Bowtie2