Alignment using BWA tool
2
0
Entering edit mode
2.4 years ago

Hi All,

I am doing alignment using BWA mem using command

 bwa index -p H37Rv_reference_genome -a bwtsw GCF_000195955.2_ASM19595v2_genomic.fasta

bwa mem H37Rv_reference_genome ERR3566420_1.fastq ERR3566420_2_trimmed.fastq >ERR3566420.sam
samtools view -bS ERR3566420.sam >ERR3566420.bam
samtools flagstat ERR3566420.bam


Now when I am seeing the percentage of the mapped reads I am seeing really less alignment to be precise (8.83%) paired and (7.19%) properly paired. My question is if both are related species of Mycobacterium how they are having such low alignment. Do I need to make come changes in the parameters?

Also I used Kraken tool for the alignment using this command

  ./kraken2 --db mycobacterium_database/ --threads 20 SRR8594817_1.fastq SRR8594817_2.fastq >SRR8594817_kraken.txt


So the results from this is (99.07%) alignment.

I am not able to understand that why am I getting such low alignment using BWA tool having reference genome as H37Rv genome and why the results from the Kraken tool are so contradictory.

Kindly help me with the possible suggestion, Thanks.

alignment metagenomics kraken BWA tool • 784 views
0
Entering edit mode

Are you sure trimming one fastq and not the other is what you want?

0
Entering edit mode

Hi, I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?

0
Entering edit mode

I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?

Absolutely. If you don't do that then if a read is removed from the trimmed file then that will throw off the order of reads in the file pair. Aligners do not check to see if read pairs are present in corresponding proper order. It would be best to re-trim your data by starting over and using both files at the same time when you trim.

0
Entering edit mode

Given that 99% of the untrimmed reads aligned to the Kraken database, they probably shouldn't bother removing any reads at all.

0
Entering edit mode

OK thanks, I'll once again repeat the procedure.

0
Entering edit mode
2.4 years ago

I don't see a contradiction. Aligning to a bunch of mycobacterial genomes is not the same as aligning to a single one. And you aligned some kind of trimmed fastq to the single genome, which might have messed up your paired end matching.

0
Entering edit mode

Even after doing the alignment for the good quality data where I didn't require to trim the reads, I could not find the significant alignment result when comparing it with the reference genome which belonged to the same genus.

0
Entering edit mode
2.4 years ago
GenoMax 115k

kriti.awasthi23 : I had a look at the dataset you posted originally (ERR3566420). With (GCF_000195955.2_ASM19595v2_genomic.fna:>NC_000962.3 Mycobacterium tuberculosis H37Rv) alignment stats were as follows using bbmap.sh aligner from BBMap suite.

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:              5.5300%             673         5.5300%             203246


This dataset has only ~3.6M reads.

I noticed that ERR* entry mentioned this to be Mycobacterium avium genome so with (GCF_000007865.1_ASM786v1_genomic.fna:>NC_002944.2 Mycobacterium avium subsp. paratuberculosis K-10) things improved a bit.

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:             20.6081%            2508        20.6081%             757416


SRR8594817 data is from yet another genome (GCF_000015005.1_ASM1500v1_genomic.fna:>NC_008596.1 Mycobacterium smegmatis str. MC2 155 chromosome) and alignment of this data to that genome results in (with ~10.1M reads)

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:             96.0233%         4862842        96.0368%          990823412


So pretty good overall.

Conclusion: ERR3566420 is probably a bad dataset with not enough reads.

0
Entering edit mode

Hi Sir, Thank you so much for providing so much clarity. I am new to this kind of analysis. Can you please guide me as to how you have done the analysis and how you are viewing the stats result (precisely commands)? Also my basic question is after alignment ho much depth and coverage we should check for the alignment samples. Also can you please guide me with how you got to know 3.1million reads and 10.1million reads respectively.

0
Entering edit mode

Also I want to know as to how to understand which data is good and which data to consider for further analysis ? I am checking the quality of the data by FastQC but still even after checking the quality of the data the alignment doesn't happens. What is the criteria for selecting the data ?