Sequence alignment help: BWA is not my friend!
2
2
Entering edit mode
7.7 years ago
ab.tsubaki ▴ 50

Hi all

I need some help with sequence alignment using BWA as I'm at my wits end.

Background: I've sequenced wheat DNA (note, highly repetitive) using the Illumina HiSeq 2000. I have contigs and scaffolds as assembled by the sequencing company and I have two reference genomes that should be pretty similar to my sequences. I have attempted sequence alignments of contigs and scaffolds against both reference genomes using BWA bwasw with default parameters. Note that the two reference genomes are in large scaffolds themselves.

Problem: I keep getting really bad alignments! Aligned sequences are heavily truncated in every instance, with less than half the sequence actually aligning to the reference! And if something does align, it's repetitive sequence. I have masked repetitive sequences and this alleviates the problem a little but doesn't solve the massive truncating problem. In addition, the majority of my sequence data doesn't align to anything and is clustered under "unaligned sequences".

Anyone with BWA experience or any experience with sequence alignment? I'd really appreciate a push in the right direction as this has been keeping me busy for much too long!

Thanks!

next-gen alignment • 4.9k views
1
Entering edit mode

Hi, could you please add some examples of the commands you used, and of the version of the data formats you are using? It is difficult to know what is the cause of your problem otherwise. Can you check if the quality scores in your fasta file are encoded using the phred+33 scheme, and not phred+64 (see wikipedia - fastq)? Another possibility is to use a software like Trinity for doing assembly without a reference genome.

0
Entering edit mode

The scripts I was using to align is as follows:

/opt/exp_soft/bwa-0.7.5a/bwa bwasw -t 3 /reference genome /fasta file 1 \
| /opt/exp_soft/samtools-0.1.19/samtools view -bS -h -o /output file.bam -


The data formats are in fasta. I did create an index for the reference genome before hand.

I'm not sure about the quality scores for my fasta file - what impact could that have?

I'm not really using the mapping to assemble, I need to compare to the reference genome for comparison's sake.

0
Entering edit mode

The quality scores would affect your results because if they are not encoded correctly, bwa may think that the sequences have a lower quality than the real one, and align them more poorly. But I don't think that this is your case anyway. In any case, you should look at the distribution of quality scores in your dataset (e.g. check the option QualityScoreDistribution in Picard Tools).

Another question is if there any specific reason why you used the bwa bwasw algorithm. The bwa mem should be more accurate, did you try that?

p.s. in the command line you posted there are some errors in the paths to the reference genome index file and the fasta files.

1
Entering edit mode
7.7 years ago

BWA is a short read aligner that was designed to quickly match short reads to genomes. It won't work all that well when you are aligning contigs to other contigs.

You will need to use other tools: for example lastz or MUMMER or do pairwise alignments with blast.

Unfortunately these tools in turn may have their own limitations (lastz works best in pairwise mode, and the output of MUMMER is not in SAM format) and may not produce outputs that are that are as easy to interpret as that of bwa. Other tools to explore are those that fall into assembly evaluation like

1
Entering edit mode

Both bwa-sw and bwa-mem work with scaffold-to-scaffold alignment as long as the sequence divergence is not so large. I suspect the problem is caused by the input data. Also the wheat genome is much larger than human. Mummer and Quast which uses mummer probably won't work.

0
Entering edit mode

So now I'm confused. The difference between my sequence and my reference shouldn't be that great - they're just different cultivars. Also, I only sequenced one arm of one chromosome and I'm comparing it to a reference genome, scaffolds of that same chromosome arm only. Would Mummer work alright with a smaller sequencing subset? My chromosome arm should be roundabout 200Mb?

0
Entering edit mode

I would first check if these are really wheat sequences (e.g. mapping wheat cDNA with blat) and whether there are low-level errors in data processing. Bwa-mem is also recommended over bwa-sw, though I doubt this can make a difference. If you are aligning 200Mb against 200Mb, Mummer should work. It is worth trying. A question though: how did you sequence one arm of a chromosome?

0
Entering edit mode

Thanks. I think I should try a couple of different things and see what works best. I am worried about my data quality though - the sequencing stats all look good, but who's to say it's all wheat sequences?!?!

Oh, for the chromosome sequencing: I made metaphase preparations on microscope slides and excised the chromosomes I wanted with a laser using this fancy microscope called the PALM® Robot-Microbeam system (P.A.L.M. Microlaser Technologies AG, Bernried, Germany).

0
Entering edit mode
7.7 years ago
ab.tsubaki ▴ 50

Thanks for the help guys!

So, BWA doesn't seem like the best option then.

I'll take a look at some other options. Does anyone have any experience with Bowtie? Would that be a viable option?

0
Entering edit mode

no, sorry, bowtie also is only for short-reads alignment. In your case you should use one of the tools suggested by Istvan.

1
Entering edit mode

Thanks!

Guess I should get cracking!