Question: Alignment using BWA tool
0
gravatar for kriti.awasthi23
7 months ago by
kriti.awasthi2310 wrote:

Hi All,

I am doing alignment using BWA mem using command

 bwa index -p H37Rv_reference_genome -a bwtsw GCF_000195955.2_ASM19595v2_genomic.fasta

 bwa mem H37Rv_reference_genome ERR3566420_1.fastq ERR3566420_2_trimmed.fastq >ERR3566420.sam
 samtools view -bS ERR3566420.sam >ERR3566420.bam
 samtools flagstat ERR3566420.bam

Now when I am seeing the percentage of the mapped reads I am seeing really less alignment to be precise (8.83%) paired and (7.19%) properly paired. My question is if both are related species of Mycobacterium how they are having such low alignment. Do I need to make come changes in the parameters?

Also I used Kraken tool for the alignment using this command

  ./kraken2 --db mycobacterium_database/ --threads 20 SRR8594817_1.fastq SRR8594817_2.fastq >SRR8594817_kraken.txt

So the results from this is (99.07%) alignment.

I am not able to understand that why am I getting such low alignment using BWA tool having reference genome as H37Rv genome and why the results from the Kraken tool are so contradictory.

Kindly help me with the possible suggestion, Thanks.

ADD COMMENTlink modified 7 months ago by swbarnes28.2k • written 7 months ago by kriti.awasthi2310

Are you sure trimming one fastq and not the other is what you want?

ADD REPLYlink written 7 months ago by swbarnes28.2k

Hi, I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?

ADD REPLYlink written 7 months ago by kriti.awasthi2310

I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?

Absolutely. If you don't do that then if a read is removed from the trimmed file then that will throw off the order of reads in the file pair. Aligners do not check to see if read pairs are present in corresponding proper order. It would be best to re-trim your data by starting over and using both files at the same time when you trim.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax87k

Given that 99% of the untrimmed reads aligned to the Kraken database, they probably shouldn't bother removing any reads at all.

ADD REPLYlink written 7 months ago by swbarnes28.2k

OK thanks, I'll once again repeat the procedure.

ADD REPLYlink written 7 months ago by kriti.awasthi2310
0
gravatar for swbarnes2
7 months ago by
swbarnes28.2k
United States
swbarnes28.2k wrote:

I don't see a contradiction. Aligning to a bunch of mycobacterial genomes is not the same as aligning to a single one. And you aligned some kind of trimmed fastq to the single genome, which might have messed up your paired end matching.

ADD COMMENTlink written 7 months ago by swbarnes28.2k

Even after doing the alignment for the good quality data where I didn't require to trim the reads, I could not find the significant alignment result when comparing it with the reference genome which belonged to the same genus.

ADD REPLYlink written 7 months ago by kriti.awasthi2310
0
gravatar for genomax
7 months ago by
genomax87k
United States
genomax87k wrote:

kriti.awasthi23 : I had a look at the dataset you posted originally (ERR3566420). With (GCF_000195955.2_ASM19595v2_genomic.fna:>NC_000962.3 Mycobacterium tuberculosis H37Rv) alignment stats were as follows using bbmap.sh aligner from BBMap suite.

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:              5.5300%             673         5.5300%             203246

This dataset has only ~3.6M reads.

I noticed that ERR* entry mentioned this to be Mycobacterium avium genome so with (GCF_000007865.1_ASM786v1_genomic.fna:>NC_002944.2 Mycobacterium avium subsp. paratuberculosis K-10) things improved a bit.

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:             20.6081%            2508        20.6081%             757416

SRR8594817 data is from yet another genome (GCF_000015005.1_ASM1500v1_genomic.fna:>NC_008596.1 Mycobacterium smegmatis str. MC2 155 chromosome) and alignment of this data to that genome results in (with ~10.1M reads)

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:             96.0233%         4862842        96.0368%          990823412

So pretty good overall.

Conclusion: ERR3566420 is probably a bad dataset with not enough reads.

ADD COMMENTlink modified 7 months ago • written 7 months ago by genomax87k

Hi Sir, Thank you so much for providing so much clarity. I am new to this kind of analysis. Can you please guide me as to how you have done the analysis and how you are viewing the stats result (precisely commands)? Also my basic question is after alignment ho much depth and coverage we should check for the alignment samples. Also can you please guide me with how you got to know 3.1million reads and 10.1million reads respectively.

ADD REPLYlink modified 7 months ago • written 7 months ago by kriti.awasthi2310

Also I want to know as to how to understand which data is good and which data to consider for further analysis ? I am checking the quality of the data by FastQC but still even after checking the quality of the data the alignment doesn't happens. What is the criteria for selecting the data ?

ADD REPLYlink modified 7 months ago • written 7 months ago by kriti.awasthi2310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour