Interpreting mapping contaminants
1
0
Entering edit mode
5.4 years ago
cecilio11 ▴ 110

Dear Biostars,

I am posting this question in this forum because I found this wonderful site is populated mostly by very kind and helpful people

I am a beginner on the field of bioinformatics and genome assembly.

I am working on assembling several bacterial genomes. The sequences were obtained from pacbio. I found that there may be some contamination in the sequences of the bacteria I am trying to assemble. I used Minimap2 to align potential contaminants to the bam files (converted to fastq) . I used as reference for minimap2 the genomes of potential contaminants.

For example, i am trying to assemble a genome from the genus Ignatzschineria (Ignatz). The potential contaminants are: Moellerella (Moell) , Kurthia, Providencia (Prov). I also mapped the query to two genomes from the genus Ignatzschineria.

An example of the command line I used to align the fastq to a reference genome is:

minimap2 -ax map-pb KurthiaSp11kri321_CG_RefSeq_NZ_CP013217_1.fasta m54092_180525_143105.subreadsFQ.fastq > KurthiaRefVsIgnatz_alignmentpacbio.sam

The stderr and stdout from minimap2 seem to me unusable (am I correct?)

After the alignment, I used samtools to obtain the flagstas for each minimpa2-outputed .sam file: For example:

samtools flagstat KurthiaRefVsIgnatz_alignmentpacbio.sam

I need your help to interpret the results I obtained from flagstats for the 6 reference genomes:


samtools flagstat MoellRefVsIgnatz_alignmentpacbio.sam              
740728 + 0 in total (QC-passed reads + QC-failed reads)             
5523 + 0 secondary              
129702 + 0 supplementary                
0 + 0 duplicates                
599270 + 0 mapped (80.90% : N/A)                
0 + 0 paired in sequencing              
0 + 0 read1             
0 + 0 read2             
0 + 0 properly paired (N/A : N/A)               
0 + 0 with itself and mate mapped               
0 + 0 singletons (N/A : N/A)                
0 + 0 with mate mapped to a different chr               
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat KurthiaRefVsIgnatz_alignmentpacbio.sam                
621596 + 0 in total (QC-passed reads + QC-failed reads)             
15296 + 0 secondary             
797 + 0 supplementary               
0 + 0 duplicates                
20226 + 0 mapped (3.25% : N/A)              
0 + 0 paired in sequencing              
0 + 0 read1             
0 + 0 read2             
0 + 0 properly paired (N/A : N/A)               
0 + 0 with itself and mate mapped               
0 + 0 singletons (N/A : N/A)                
0 + 0 with mate mapped to a different chr               
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat IgnatzUreRefVsIgnatz_alignmentpacbio.sam          
613190 + 0 in total (QC-passed reads + QC-failed reads)         
4455 + 0 secondary          
3232 + 0 supplementary          
0 + 0 duplicates            
15101 + 0 mapped (2.46% : N/A)          
0 + 0 paired in sequencing          
0 + 0 read1         
0 + 0 read2         
0 + 0 properly paired (N/A : N/A)           
0 + 0 with itself and mate mapped           
0 + 0 singletons (N/A : N/A)            
0 + 0 with mate mapped to a different chr           
0 + 0 with mate mapped to a different chr (mapQ>=5)         

samtools flagstat IgnatzLARVAERefVsIgnatz_alignmentpacbio.sam           
609223 + 0 in total (QC-passed reads + QC-failed reads)         
3430 + 0 secondary          
290 + 0 supplementary           
0 + 0 duplicates            
8303 + 0 mapped (1.36% : N/A)           
0 + 0 paired in sequencing          
0 + 0 read1         
0 + 0 read2         
0 + 0 properly paired (N/A : N/A)           
0 + 0 with itself and mate mapped           
0 + 0 singletons (N/A : N/A)            
0 + 0 with mate mapped to a different chr           
0 + 0 with mate mapped to a different chr (mapQ>=5)         

samtools flagstat ProvALCALRefVsIgnatz_alignmentpacbio.sam
664124 + 0 in total (QC-passed reads + QC-failed reads)
49931 + 0 secondary
8690 + 0 supplementary
0 + 0 duplicates
129215 + 0 mapped (19.46% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat ProvRETTRefVsIgnatz_alignmentpacbio.sam
659123 + 0 in total (QC-passed reads + QC-failed reads)
12368 + 0 secondary
41252 + 0 supplementary
0 + 0 duplicates
133975 + 0 mapped (20.33% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I believe the results above indicated that Ignatz may not be Ignatz, but Moellerella, numeral "1." (80% of the queried sequences map to this genus).

Also, the 2 Ignatzschineria used as reference, IgnatzUre (numeral "3.") and IgnatzLarvae (numeral ".4") produce a low % of queried sequences mapping to them.

The 2 Providencia show about 20% of the queried sequences mapping to them. This may reflect their phylogenetic relationship.

Is my interpretation correct?

Regards,

Cecilio

alignment • 2.7k views
ADD COMMENT
0
Entering edit mode

The stderr and stdout from minimap2 seem to me unusable (am I correct?)

Not sure what you mean here. With > you are writing to stdout, your sam file is stdout. Note that you could also do the following:

minimap2 -ax map-pb genome.fasta reads.fastq | samtools sort -o alignment.bam

and get a sorted bam without having to do additional commands with intermediate files.

ADD REPLY
0
Entering edit mode

Also, you could save stderr with 2>stderr.txt. I never had problems with stdout / stderr and minimap2, what is the problem you are facing? You are not using nohup, by any chance?

ADD REPLY
0
Entering edit mode

WoterDeCoster and h.mon,

I am running minimap2 in a cluster that use an LSF. The alignment output is written to a directory. The stdout is an empty file (because the .sam file is written to the directory). The stderr looks like this:

[M::mm_idx_gen::0.1500.78] collected minimizers [M::mm_idx_gen::0.1791.11] sorted minimizers [M::main::0.1791.11] loaded/built the index for 108 target sequence(s) [M::mm_mapopt_update::0.1861.10] mid_occ = 5 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 108 [M::mm_idx_stat::0.1901.10] distinct minimizers: 425615 (99.34% are singletons); average occurrences: 1.008; average spacing: 7.699 [M::worker_pipeline::102.5472.99] mapped 75378 sequences [M::worker_pipeline::201.6473.02] mapped 78093 sequences [M::worker_pipeline::302.7553.03] mapped 78270 sequences [M::worker_pipeline::403.2203.03] mapped 78717 sequences [M::worker_pipeline::503.4563.04] mapped 78069 sequences [M::worker_pipeline::603.3333.04] mapped 78079 sequences [M::worker_pipeline::702.1573.04] mapped 77556 sequences [M::worker_pipeline::780.775*3.03] mapped 61341 sequences [M::main] Version: 2.14-r883 [M::main] CMD: minimap2 -ax map-pb Moellerella_wisconsensis_2_GCF_001294465.1_Mwi35017_DRAFTv2_genomic.fasta m54092_180525_143105.subreadsFQ.fastq [M::main] Real time: 780.796 sec; CPU: 2367.487 sec; Peak RSS: 4.082 GB

This stderr file seems to be uninformative for my purpose and I wanted to know if experienced people agrees with me.

My main concern is the interpretation of the output. As I mention above, "I believe the results above indicated that Ignatz may not be Ignatz, but Moellerella, numeral "1." (80% of the queried sequences map to this genus)". Is this conclusion in agreement with the flagstats I show above?

Thank you very much, again

Cecilio

ADD REPLY
0
Entering edit mode

WouterDeCoster,

Thank you for your comment. I will apply your suggested "pipe" to my next alignments.

What is your opinion on using the options "-G0" (no introns because these are bacteria) and "-N 5 --secondary=no" (no secondary alignments)?

So, I would have to drop the option -x (setting multiple parameters at the same time) in order for -G0 to work.

The line command will look like this:

minimap2 -a -G0 -N 5 --secondary=no RefSeq.fasta reads.fastq > alignment.sam

Do you think it is worth to drop the map-pb (setting for pacbio) to include the parameters G0 and "--secondary=no"?

Regards,

Cecilio

ADD REPLY
0
Entering edit mode
5.4 years ago
h.mon 35k

A shortcoming in your approach is to compare to a limited number of contaminants, and to just one at a time.

I like to use BlobTools (blasting against NCBI NT) to explore the taxonomic assignment of an assembly, and detect possible contamination - that is, I check for contaminants post-assembly.

You can also use sketches to analyse contamination either on your raw data (pre-assembly) or on assemblies, see:

Mash Screen: what's in my sequencing run?

What’s in my metagenome?

Tool: BBSketch - A Tool for Rapid Sequence Comparison

Finally, you can also use kmer screening tools like Kraken or Centrifuge to screen and filter out contaminants.

ADD COMMENT
0
Entering edit mode

h.mon, That sounds great. There are so many tools for cleaning contaminants. I wonder if I have closely related bacterial species, are those tools sensitive enough to distinguish between genes that have been horizontally transmitted between bacteria and are not in fact contaminants? Regards, Cecilio1

ADD REPLY
0
Entering edit mode

Are the bacterial genus above closely related? What is the relation of horizontal transmission to your original question? Detecting horizontal transmission with accuracy is hard, detecting horizontal transmission from closely related species is even harder. Now, if you have horizontal transmission and technical contamination, you made the task impossible.

ADD REPLY
0
Entering edit mode

I see. It may be better to sequence the bacteria again, making sure that there is no contamination.

ADD REPLY

Login before adding your answer.

Traffic: 3228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6