Error Correction with Tadpole and BBMerge
1
0
Entering edit mode
20 days ago
Jon • 0

I am trying to error correct shotgun reads so that they will assemble better. Due to memory issues, I can't just do the whole read set, and I was wondering about breaking it into smaller sections, but don't know if this would create different problems.

So in my reads, I have some known species that I have obtained good contigs at 100% ID. Can I use BBduk for example against the reference to pull reads out that match specific species, and error correct them separately, then concatenate everything back together again for de novo assembly??? Then same thing again using Kraken2 to take everything for example that it identifies as "bacteria" or something else.

I was thinking that BBmerge, which I believe works directly on each pair, shouldn't have affect on other reads?? Though Tadpole uses Kmers from all of the reads used for error correction, wouldn't this need to be used on everything I want to eventually assemble de novo, or would it matter if I did this on subsets of say bacteria/fungi/etc ??

Is there a consensus opinion of running BBMerge or Tadpole first??? I had previously seen from the author about running BBMerge error correction before Tadpole, but I've also seen others talk about running Tadpole first.

Tadpole Extend Reads - I have read that this apparently improves assembly using Metaspades, if I use the extend reads option, is it recommended to use BBMerge to error correct the potential overlap after???

Thanks!!

BBmerge tadpole error correction • 742 views
ADD COMMENT
0
Entering edit mode

Not exactly an answer to your question, but it still may be useful.

If you are going to assemble with SPAdes in the end, it does error correction so you probably shouldn't do it beforehand.

A small C program for error correction that does an excellent job, is multithreaded, and likely won't cause memory problems:

https://github.com/lh3/bfc

This normally works for me and is very fast, plus can't remember ever having memory issues.

bfc -1 -k 21 -t 20 reads.fastq.gz > reads.bfc_corrected.fastq
ADD REPLY
0
Entering edit mode
19 days ago
GenoMax 152k

Can I use BBduk for example against the reference to pull reads out that match specific species, and error correct them separately

Filtering reads that match one or more reference genomes is a task bbduk does. I don't understand the error correction part. Error correction uses the data in your dataset. There is a page here that describes how tadpole.sh works (it is actually a k-mer based assembler) : https://archive.jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/tadpole-guide/

BBMerge is used for creating a longer representation of library fragments when the paired end reads overlap in the middle (Illustration in this answer should help with visualization: What is the difference between paired end reads and overlapping reads, and then why merge overlapping reads before assembly? ) . In normal genomic libraries this rarely happens unless the library inserts are short. If that is true, it will indicate poorly made libraries, since one normally aims to get 350-450 bp inserts (longer than common illumina read lengths). While bbmerge can use error correction to improve the paired-end read merge, it is not an error correction tool per se for entire data. see more: https://archive.jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmerge-guide/

Ref: I am going to leave a link for the prior thread where we discussed the experiment behind this data extensively: Host Removal Issues Human/Dog with samples containing Eukaryotic species

While we did not discuss this previously, what is the read length of your datasets and are they paired-end for all samples.

ADD COMMENT
0
Entering edit mode

So of all of the sets that I'm working with, they are all 150x2 reads.

I have 2 sets that were done on Novaseq 6000, I believe these are called high quality Mate-Pair reads, as they are in FR orientation. My other sets were done on the Aviti sequencer.

I understand the target read length, and probably overall they are in that range.

So my assumption that give 150bp reads, if after adapter removal most reads that fall below 150bp are likely mergeable, and the BBMerge overlap correction can help correct errors on the ends of the reads. Pairs that remain at 150bp after adapter removal have a length somewhere over 300bp.

It's not my goal to actually merge the reads, but just to do error correction for a better assembly. But maybe merging is OK too??? I do not believe that metaspades can include single reads, so they want paired reads. Apparently Metaspades wants paired reads to be interleaved, as I tried using as separate pairs and when I looked at the corrected reads, half of my reads disappeared. I did again interleaved and it looked like all or most of my reads were present.

I assume that it's best if you can take all of the reads that are intended to be assembled and process with tadpole all together, but if it won't run due to running out of memory, that's my main question. So if I can't run them all at the same time, I wondered about running subsets of the reads for error correction, then combining them back again for assembly.

ADD REPLY
0
Entering edit mode

I believe these are called high quality Mate-Pair reads

Are you sure? Mate-pair libraries are not common and it would be surprising if you have them. They will need different analysis. By FR you must mean you have normal forward-reverse reads from a normal library.

if after adapter removal most reads that fall below 150bp are likely mergeable

If you have 300+ bp inserts then your reads should not contain adapters. If they do have adapter sequences on 3'-end of Illumina (and possibly Aviti, which I don't have experience with) then that means you have inserts that are shorter then ~125 bp.

If you have normal libraries then the reads will not merge, since 150 bp will be shorter than the insert size.

You are referring to MetaSpades, so you feel there is data from more than one species left over after removing the host or are you using the entire dataset?

ADD REPLY
0
Entering edit mode

Since they are shotgun reads, I'm assuming there is more variability in read lengths?? Most of the reads are I think over 300bp, since my ending reads are mostly 150bp after adapter removal.

As for the Mate Pair, these were done by Zymo Research which I believe states mate pair reads. They contain Nextera Transpose adapters, which I believe are used by mate pair sequencing. Standard Mate Pair is RF, but since these are FR I had read they called them High Quality Mate Pair. These were fairly problematic reads, as there are a LOT of very short reads in the 35-50bp range. Quite a lot had TONS of Poly-G tails, many of which there was nothing left of Read 2.

The quality issues on the Mate Pair reads I think is an issue, there are quite a lot of reads that look like this starting with Q40. Many times if I blast the read, it will match 100% to a species, sometimes only the front 50%. Often I have read to only trim the ends, how would you trim this?

TCCCAAAGTGACGGGATGACAGGCATGAGCCACCGCGCCCGGCCTATTGTATTGTATTGTATTGTATTTTATTGTGTTTTGTTTTGTTTTTTTTTCTTTTCTCCTCTCTTCCCCTCTTTTCCCCCCCCCTCCCCCCCCCCACCCCCCCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIII-II-99-I-I-II-9--9----9-9-99-9--999--999----99---------9---9-----99-9-I---9-9999-999-999-
ADD REPLY
0
Entering edit mode

I'm assuming there is more variability in read lengths??

There should be no variability in read lengths (for Illumina) before scanning and trimming. After scanning and trimming reads will get shorter, if adapter sequence is found and removed.

Most of the reads are I think over 300bp, since my ending reads are mostly 150bp after adapter removal.

Fragments (not reads) are going to be over 300 bp (which is expected for normal genomic libraries). Reads are going to remain 150 bp, if no adapters are present.

These were fairly problematic reads, as there are a LOT of very short reads in the 35-50bp range. Quite a lot had TONS of Poly-G tails, many of which there was nothing left of Read 2.

Did you check the link above for mate-pair libraries (see the graphic to confirm)? Since these libraries bring two fragment ends (that are 2 - 5 kb apart) near each other after circularization, the TLEN in alignments will be large. Otherwise there may be some issue with these libraries. BTW: A quick look does not seem to bring up any kits for mate-pair libs from Zymo but that may be wrong.


As for the example read above, it does not seem to contain any Illumina adapters (I don't know about aviti). First half of the read maps to Human genome. Second half of the read has poor quality and if you were to trim keeping a min quality of 15 then we end up with

TCCCAAAGTGACGGGATGACAGGCATGAGCCACCGCGCCCGGCCTATTGTATTGTATTGTATTGTATTTTATT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIII-II-99-I-I-II

At a min Q20 the read gets trimmed back further

TCCCAAAGTGACGGGATGACAGGCATGAGCCACCGCGCCCGGCCTATTGTATTGTATTGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIII
ADD REPLY
0
Entering edit mode

Yes, I had read about mate pair reads previously. Working with Zymo Research was problematic, as it didn't seem like anyone I talked to actually knew anything. Also, after seeing all the error issues with their shotgun sequencing, it wasn't worth sending them anymore samples. Entirely possible they are just standard paired end reads, it was just that from what I read that after trimming mate pair reads, that there are going to be a lot of very short reads after the adapters are removed.

So anyways, after trimming, it leaves me with quite a lot of single end reads, many of which are non-human. How can I include these in the assembly??? I had thought I could possibly reverse complement them, convert both to BAM, convert back to interlaced forward/reverse. I'm assuming Metaspades would correct any errors? I know it's not optimal.

ADD REPLY
0
Entering edit mode

I read that after trimming mate pair reads, that there are going to be a lot of very short reads after the adapters are removed.

Again that would only be true in cases where the insert sizes are smaller than read lengths. Otherwise that should not happen.

So anyways, after trimming, it leaves me with quite a lot of single end reads, many of which are non-human.

This is not making immediate sense. Investigate why the second read is getting discarded. Are you doing Q score filtering and is that the reason for this?

I had thought I could possibly reverse complement them, convert both to BAM, convert back to interlaced forward/reverse.

Not sure what you are proposing here.

If you have a lot of single end reads and want to include them then perhaps you can only use the trimmed R1 files as input for metaSPAdes.

ADD REPLY
0
Entering edit mode

I was just thinking, I think it's OK to add single reads to the end of interlaced reads???

ADD REPLY
0
Entering edit mode

Don't do that unless a program tells you that it is OK to do so. Otherwise programs will assume that all read are interlaced and you would get odd results and/or errors.

Note: Please use ADD REPLY/ADD COMMENT when responding to existing posts. SUBMIT ANSWER should be used for new answers to original question.

ADD REPLY
0
Entering edit mode

GOT IT!!! I didn't notice the button.

Thanks for your help

ADD REPLY

Login before adding your answer.

Traffic: 1529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6