Question: Troubleshooting low quality mapping
1
gravatar for h.l.wong
2.9 years ago by
h.l.wong50
Australia
h.l.wong50 wrote:

Hi all, I have been working on the metagenomics analysis on a microbial community. It was sequenced on Illumina Nextera sequence, 2x 150bp. I have obtained low map reading quality (properly paired ~23%), and I would like to know what can be done to raise the mapping quality. Here is what I have done:

  1. Used trimmomatic to trim away low quality reads
  2. Concatenate forward files into a single forward file (fastq), and the same for reverse files
  3. FastQC to check quality
  4. IDBA-UD for assembly (mink = 20, maxk = 100)
  5. Bowtie2 for mapping, then samtools flagstat for statistics, but obtained low quality mapping (~23%)

The flagstat looks like this:

41641174 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

18804112 + 0 mapped (45.16% : N/A)

41641174 + 0 paired in sequencing

20820587 + 0 read1

20820587 + 0 read2

9847366 + 0 properly paired (23.65% : N/A)

17193728 + 0 with itself and mate mapped

1610384 + 0 singletons (3.87% : N/A)

1906784 + 0 with mate mapped to a different chr

1813704 + 0 with mate mapped to a different chr (mapQ>=5)

I am not sure what caused the low mapping quality, do I need to concatenate files first then use trimmomatic? As this might have caused forward and reverse file losing reads at different position (eg. Forward losing bases at position 1000, but reverse reads losing bases at position 9000).

What could I do to raise the mapping quality?

Cheers and thank

Alan

alignment assembly • 1.3k views
ADD COMMENTlink modified 2.9 years ago by Brian Bushnell17k • written 2.9 years ago by h.l.wong50

I am not sure how trimmomatic works but for IDBA-UD usually fastq reads are combined into a fasta file. IDBA-UD expects that the reads are in the same order in both files prior to combining them in a fasta file. Did you check that reads are in the same order in both fastq files before combining them into a fasta file for IDBA-UD input? If you miss this step, IDBA-UD would not generate good quality assemblies.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Sej Modha4.5k

Thank you Sej, I have checked my scripts and I have checked that the reads are in the same order.

I have concatenated the samples as below

cat S1_F.fastq S2_F.fastq S3_F.fastq S4_F.fastq

Then for reverse reads

cat S1_R.fastq S2_R.fastq S3_R.fastq S4_R.fastq

ADD REPLYlink written 2.9 years ago by h.l.wong50

That looks fine to me, for the concatenation step, at least.

ADD REPLYlink written 2.9 years ago by Brian Bushnell17k

You are sending that cat'ed results to a new file, correct? cat S1_F.fastq S2_F.fastq S3_F.fastq S4_F.fastq > total_F.fastq

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax75k

I tend to use fq2fa utility provided in the idba_ud suite with --merge option as recommended step to get fasta from paired fastq in two separate file. I am not sure if cat formats the fasta in the same way as fq2fa does, it's worth checking that though.

ADD REPLYlink written 2.9 years ago by Sej Modha4.5k

In the example above cat is only concatenating the files together. It is notchanging the format from fastq --> fasta.

ADD REPLYlink written 2.9 years ago by genomax75k

That's correct and also cat would not format the sequences in the expected input format for IDBA_UD assembly

fq2fa --merge would generate output fasta file where reads from file_1.fq and file_2.fq are ordered

>HISEQ:84:MACHINE:2:1101:1439:2046 1:N:0:TAGCTT
CAGACACACAAACACGCACACATACACACAAACACACACACAGAGACACACACAAAACACAAACACATACACACAGAAACACACACAGAAACACACACACACGCACAAACACACACAAACACACACACACACACACACACACACACACA
>HISEQ:84:MACHINE:2:1101:1439:2046 2:N:0:TAGCTT
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTTTTGTGTGTGTTTGTGTTTGTTTTGTGTGTGTGTGTGTGTGTGTGTGT
>HISEQ:84:MACHINE:2:1101:1435:2092 1:N:0:TAGCTT
CACAACCTTCGTGGCGAGTGGTTCAATGAGCTCCTGCAGGGCCGTGTCCGCCCCAACCAGGTCAGCACATGGGAGAAAGGACTACTCCACCACGCCCAGCTGGTCGTGCTCGACGAGCTCCCAAAAATGCCTCGCGGTTACCTCGAC
>HISEQ:84:MACHINE:2:1101:1292:2094 2:N:0:TAGCTT
AGGACACGTTCCTGCGTGCGTCGGCACATCTAGTTAACACTCCAAGTACCGCCATCTCGCATGCCCTGTGGCACATACACGCTTTATGCGTCCATCTAGCCGTTCTTCCGTCTGCTAGTCTGTCCGTCCGTCCGTGCGTCAATCTAGTG
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Sej Modha4.5k

In metagenomics, you should check your assembly before going into alignment.

ADD REPLYlink written 2.9 years ago by SmallChess500
1
gravatar for Brian Bushnell
2.9 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I have obtained low map reading quality (properly paired ~23%), and I would like to know what can be done to raise the mapping quality.

Strictly speaking, I would not call this mapping quality, but mapping rate and pairing rate. Mapping quality refers to the alignments of the reads that do map.

Typically, for metagenomes, this is a problem of the assembly, not the reads or the aligner. What kind of N50/L50 do you have? What's the average read depth? How diverse is the community, and how much did you sequence? What's your insert size distribution?

If you assembled a big community from insufficient data and your contigs are typically shorter than your insert size, then most reads can't possibly map as proper pairs.

I am not sure what caused the low mapping quality, do I need to concatenate files first then use trimmomatic?

There's not enough detail to know what you're asking, but I'm pretty sure the answer is no.

As this might have caused forward and reverse file losing reads at different position (eg. Forward losing bases at position 1000, but reverse reads losing bases at position 9000).

Can you rephrase this? I'm not sure what you mean. Do you mean reads are trimmed differently? That's typically expected, but without your Trimmomatic command, it's hard to evaluate.

Things are probably working as expected, but that answer could be clarified with more information.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Brian Bushnell17k

Thanks Brian.

I have assembled using IDBA-UD and my command line is listed below: idba_ud -r SSD3.fasta --mink 20 --maxk 124 --num_threads 24 -o SSD3Min20Max124/

The sequences are from a microbial community and previous 16S study showed a very diverse community. I got ~13 million sequences after trimming.

I used contig.fa generated from assembly to be the fasta file indexed in bowtie2, is this correct? My colleagues are arguing to use scaffold.fa for bowtie2 but I have no idea which one is more suitable.

Cheers and many thanks

Alan

N50 is 1566, L50 is 51287 and the number of contigs is 297599, of which 105403 are >=1000bp, according to QUAST analysis.

ADD REPLYlink written 2.9 years ago by h.l.wong50

Best to use the scaffolds; they can have substantially better contiguity.

I got ~13 million sequences after trimming

That's... very few reads for a very diverse community. Normally, we would try to use hundreds of millions of reads for a diverse metagenome; 13 million is about how many we might target for a single isolate bacteria. What sequencing platform are you using, and how many reads did you start with (before trimming)?

ADD REPLYlink written 2.9 years ago by Brian Bushnell17k

We got 20 samples sequenced and each sample contains ~13 - 20 million sequences.

Before trimming the samples got about 17 - 26 million sequences.

We used Illumina Nextera sequencing, 2 x 150bp

The command line I used in trimmomatic is listed below:

java -jar /share/apps/trimmomatic/0.33/trimmomatic-0.33.jar PE -phred33 sample.fastq sample_R2.fastq sample_R1_P.fastq S3_R1_U.fastq S3_R2_P.fastq S3_R2_U.fastq SLIDINGWINDOW:4:25 MINLEN:50

And thank you for the advice I will use the scaffolds for mapping and I will let you know how it goes

Cheers

Alan

ADD REPLYlink written 2.9 years ago by h.l.wong50

You appear to be not doing any adapter trimming, and you are quality-trimming far too harshly with a non-optimal trimmer. Overtrimming will, as you have noticed, result in unnecessarily discarding a huge amount of data. While extremely aggressive trimming can occasionally improve results in specific scenarios (like with 1000x+ coverage of a neutral-GC isolate organism), it can be disastrous to metagenomics, due to low coverage, and the Illumina platform's various sequence-related biases which affect both sequence quantity and quality.

I recommend that you download BBMap and Megahit and try this:

1) Concatenate all of your files together, as you have done, so you have read1.fq and read2.fq.

2) Trim adapters and quality at once:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=trimmed1.fq out2=trimmed2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=r trimq=8 minlen=70 maxns=0

3) Remove contaminants, which depends on what contaminants are present. But, for example: bbduk.sh -Xmx1g in1=trimmed1.fq in2=trimmed2.fq out1=clean1.fq out2=clean2.fq ref=sequencing_artifacts.fa k=31

4) Assemble with Megahit.

adapters.fa and sequencing_artifacts.fa are both distributed with BBMap, in the /bbmap/resources/ directory.

For other assemblers, there are other steps for an optimal assembly (for example, with IDBA-UD you should use Tadpole for error-correction), but Megahit produces good metagenomic assemblies quickly and efficiently, and this is the best pipeline I've found for it, given this level of detail.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Brian Bushnell17k

Thank you so much Brian.

I have a couple of questions regarding to your reply:

  1. I have checked with FastQC and there's no adapter contamination and I have also checked with my sequencing provider and they told me that adapter is automatically removed as default in Nextera sequencing. So does the bbduk.sh command works if I remove "ref=adpaters.fa" and "tbo"?

2.After assembling with Megahit, which output files should I use for say Prodigal for functional annotation and Bowtie2 for mapping?

Cheers and many thanks

Alan

ADD REPLYlink written 2.9 years ago by h.l.wong50

1) Even if FastQC doesn't show it, they still have adapter sequence that should be trimmed. Even 0.3% or so will interfere with assembly of metagenomes because there's a lot of low-coverage stuff. But to answer your question, yes, you can remove "ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo" and then it will just do quality-trimming and removal of reads with Ns.

2) Hmmm, if I recall correctly, Megahit produces a file called "contigs.fa" which is what you will want to use. Incidentally, I recommend BBMap for mapping :)

ADD REPLYlink written 2.9 years ago by Brian Bushnell17k

Thank you Brian,

bbduk.sh works great but I am stuck with a bbmap error. Here is the link to the new questions. BBMap error with no output

Thank you

Alan

ADD REPLYlink written 2.8 years ago by h.l.wong50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1650 users visited in the last hour