HiSeq 4000 PhiX screening and removal using bbduk
0
1
Entering edit mode
3.7 years ago
Anand Rao ▴ 360

I have Illumina HiSeq 4000 paired end reads. I am trying to follow the step-by-step procedure outlined at http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

step1 is complete Force-Trim Modulo:

bbduk.sh in=reads.fq out=clean.fq ftm=5


> Input:                    16596362 reads      2489454300 bases. FTrimmed:
> (0.00%)   0 bases (0.00%) Result:                     16596362 reads
> (100.00%)     2489454300 bases (100.00%)


> Input:                    16596362 reads      2489454300 bases. FTrimmed:
> (0.00%)   0 bases (0.00%) Result:                     16596362 reads
> (100.00%)     2489454300 bases (100.00%)


Moving on to step 2 - phix spike-in removal, which is where I think I see something unexpected

bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt


For forward reads, not a huge fraction removed

Input:                      16596362 reads      2489454300 bases.
Contaminants:               831 reads (0.01%)   124650 bases (0.01%)
Total Removed:              831 reads (0.01%)   124650 bases (0.01%)
Result:                     16595531 reads (99.99%)     2489329650 bases (99.99%)


BUT for reverse reads, this fraction seems quite large, doesn't it?

Input:                      16596362 reads      2489454300 bases.
Contaminants:               3088648 reads (18.61%)  463297200 bases (18.61%)
Total Removed:              3088648 reads (18.61%)  463297200 bases (18.61%)
Result:                     13507714 reads (81.39%)     2026157100 bases (81.39%)


File listings are - outm 'matched' output files for reverse and forward inputs

aksrao@farm:~/FUSARIUM/solexa/EthFoc11$ls -lht EthFoc-11_*_matched_phix.fq -rw-rw-r-- 1 aksrao aksrao 1.1G Aug 18 22:43 EthFoc-11_2_matched_phix.fq -rw-rw-r-- 1 aksrao aksrao 292K Aug 18 22:37 EthFoc-11_1_matched_phix.fq  File listings are - out 'unmatched' output files for reverse and forward inputs aksrao@farm:~/FUSARIUM/solexa/EthFoc11$ ls -lht EthFoc-11_*_clean_phix.fq
-rw-rw-r-- 1 aksrao aksrao 4.6G Aug 18 22:43 EthFoc-11_2_clean_phix.fq
-rw-rw-r-- 1 aksrao aksrao 5.6G Aug 18 22:37 EthFoc-11_1_clean_phix.fq


Given the information at https://support.illumina.com/sequencing/sequencing_kits/phix_control_v3/questions.html

What is the minimum PhiX amount that can be spiked in?

For most libraries, Illumina recommends using a low-concentration spike-in (1%) of PhiX Control v3 as an in-lane positive control for alignment calculations and quantification efficiency, my question is:

So 18.61% fraction of PhiX spike-in reads in reverse fastq file is abnormally high, isn't it? Does this seeming problem have a domino effect or has any such adverse effect been effectively avoided through this filtering step or other downstream steps in the BBmap suite, that I have progressed to yet?

My other question is about how to restore pair-wise correspondence in my filtered pair-end reads? (which was the case with my two original input files, but not the case any more I guess, since even the file sizes are vastly different)

Since I am quite new to genome assembly and NGS, more the gory details, the more help you are providing me. Thanks!

spikein filtering bbmap bbduk phix • 3.4k views
2
Entering edit mode

This is certainly very weird. The most likely scenario is that you have a huge amount of PhiX, and read 1 quality is incredibly low due to a lighting failure or something like that, so even though read 1 is also 18% PhiX, it is not matching the reference kmers. If read 2 is PhiX, read 1 should also be PhiX, unless there is chimerism (which should normally be at a very low rate).

You don't need to restore pairing after processing with BBDuk. Rather, you should maintain pairing, like this:

bbduk.sh in1=read1.fq in2=read2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=phix.fa k=31 hdist=1


That will keep or discard pairs together. Of course, your interesting results would not have shown up if you had done it this way, but that's still the way I recommend doing it.

In this specific case, I would recommend mapping to your reference to see what the error rates are like. For example:

bbmap.sh in1=read1.fq in2=read2.fq out=mapped.sam ref=ref.fa mhist=mhist.txt qhist=qhist.txt qahist=qahist.txt


I am guessing that read 1 will demonstrate extremely low quality, and it is likely that the run will need to be discarded and redone. But I'd be interested to see the actual results.

Actually, another possibility is that your R1 and R2 are from different libraries, so I'd double-check that, too.

Even when you spike in PhiX, it's normally under a different barcode (no barcode, actually) so it should not get demultiplexed into the same file as your intended data. For this reason, I've never seen 18% PhiX in a barcoded library; it's normally only seen at a very low rate even if the actual concentration was high, and the PhiX you see in a library is just due to barcode misassignment (which should not occur at a rate of 18%, but I have little experience with HiSeq 4000).

1
Entering edit mode

If the data is already demultiplexed there is small chance of having phiX in your data, unless you have some bleed-through due to .index-swapping that is known to occur with single indexed samples. You should always use dual-indexed samples on HiSeq 4000 when possible,

0
Entering edit mode

This was a single indexed run, genomax...

1
Entering edit mode

Then you likely have some cross-contamination of sample indexes. Not sure what else was in this pool/lane but keep it in mind. Does not sound like you are doing anything sensitive (e.g. low freq SNP from tumors etc) so it may be ok to have some small contamination.

0
Entering edit mode

Thank you for your insights, Brian. I have several followup questions:

1. How does one check whether R1 and R2 are from "different" libraries? These have been already supposedly demultiplexed by the genome center on campus that generated our data (~125 PE libraries) - I am currently working on just 1 of 125 paired sets, to set up my analyses pipeline... Will Illumina index primer, prior to adapter trimming, still contain the index 6-nt sequence, post-demultiplexing?

2. Do chimeras occur only during PCR step of Illumina? (based on Robert Edgar's explanation - http://bit.ly/2wcpm5N) I ask only because the library prep in this case was PCR amplification free - Kapa Hyper Plus Kit (http://bit.ly/1NEP9R6). So, then, cluster generation should've been the 1st amplification step during this Illumina HiSeq 4000 data generation, as I understand it, correct? Does your chimera hypothesis still apply?!

3. This poor quality of read 1 your are hypothesizing, because R1 and R2 should be same % of PhiX (in theory), when they are PE reads - I think I understand what you are saying. But I do not see such poor quality from FastQC of original files. See attached images below for R1 and R2, in that order. Am I missing something in your explanation OR is this possibility ruled out given the average quality per base? Should I instead generate per read average quality and look at that?

4. Comparing stats page for these phix removal runs for PE (this time) Vs individual R1 and R2 files (last time, when I 1st posted), they look like so. Does that look normal to you? Syntax used, based on your recommendation, was

bbduk.sh in1=EthFoc-11_1_clean.fq in2=EthFoc-11_2_clean.fq out1=EthFoc-11_1_clean_PE_phix.fq out2=EthFoc-11_2_clean_PE_phix.fq outm1=EthFoc-11_1_PE_matched_phix.fq outm2=EthFoc-11_2_PE_matched_phix.fq ref=phix_adapters.fa k=31 hdist=1 stats=EthFoc-11_1_clean_phix_stats_PE.txt

aksrao@farm:~/FUSARIUM/solexa/EthFoc11$cat EthFoc-11_1_clean_phix_stats_PE.txt File EthFoc-11_1_clean.fq EthFoc-11_2_clean.fq Total 33192724 Matched 3089479 9.30770% Name Reads ReadsPct PhiX_read2_adapter 3088683 9.30530% PhiX_read1_adapter 796 0.00240% aksrao@farm:~/FUSARIUM/solexa/EthFoc11$ cat EthFoc-11_1_clean_phix_stats.txt
File    EthFoc-11_1_clean.fq
Total   16596362
Matched 831 0.00501%

aksrao@farm:~/FUSARIUM/solexa/EthFoc11$cat EthFoc-11_2_clean_phix_stats.txt File EthFoc-11_2_clean.fq Total 16596362 Matched 3088648 18.61039% Name Reads ReadsPct PhiX_read2_adapter 3088648 18.61039%  5. Mapping to reference to see what error rates, per your recommendation, yielded the following results: Against a reference genome - same species and even fungal pathotype - not the best quality genome though, quite fragmented ~ 1000 contigs, but it's "something" - and assembled independently by a different group with a different approach. But I feel 12-13% mapping of reads is still very low, ain't it? BBmapVs. Ref (super fast run, less sensitive) - http://txt.do/d69ks BBmapVs. Ref (fast run, slightly more sensitive) - http://txt.do/d69kd  And just for curiosity, I mapped reads to the PHI-X 174 genome as well, and it is very very low. So I am confused. If the earlier PHIX decontamination step revealed ~ 18% reads removed from R2 (not R1), then should I still not expect to see the same ~ 18% reads in BBmap as well? Perhaps I am missing something here? BBmapVs. phix174 (super fast run, poor sensitivity) - http://txt.do/d69kk BBmapVs. phix174 (fast run, better sensitivity) - http://txt.do/d69kr  What do you advice? Also, I wonder if I should be performing this PHI-X decontamination for the rest of the 100+ PE read files, and only using those that are "good". Which brings me to the question of what the threshold(s) would be, to decide that PE reads are "good" for downstream pre-processing steps. 6. Given the steep learning curve for things NGS related (my opinion), I wonder if JGI or any such organization conducts "free-of-cost" training sessions for aspirants, but with a hands-on approach. I don't think Coursera and other MOOCs provide a hands-on live training option. The rest of the workshops can easily cost upwards of$1000 for each separate topic under NGS - RNA Seq Vs. Variant Calling Vs. WGS Vs. Assembly Vs. Annotation. Any advice on which resources to best use / leverage? I cannot imagine harassing forum users beyond 1-2 months from now, when I would not consider myself to be a newbie any more to these sorts of topics... Thanks!

2
Entering edit mode

Any advice on which resources to best use / leverage? I cannot imagine harassing forum users beyond 1-2 months from now, when I would not consider myself to be a newbie any more to these sorts of topics... Thanks!

This is a question we have struggled with in various bioinformatics support venues. NGS training that bioinformatics cores do generally uses local resources that are not going to be available outside a local institution. This effectively limits access to local participants. Even if the instructors were to do the training for free elsewhere, there would still be a charge for the facilities/compute etc. Your best bet is to find a local training course at your institution in terms of it being the cheapest (ideally free).

Do not hesitate to ask questions (when you run into a real issue and can't find a solution after searching Biostars/SeqAnswers using google). All of us are newbies in some aspect of knowledge about NGS and continue to learn on an ongoing basis.

0
Entering edit mode

Thank you genomax, I appreciate the willingness of forum members to facilitate the learning process. And I hope to pay this forward.

1
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

1
Entering edit mode

Ah - you can't determine quality by mapping in "perfectmode" or "semiperfectmode". In those cases only reads with no mismatches will map, which defeats the purpose. Personally, when I want to calculate quality metrics, I use the "slow" flag.

0
Entering edit mode

Oh OK. Thanks, Brian. Here is the modified syntax that I used, based on your correction:

bbmap.sh usejni=f slow t=8 in1=EthFoc-11_1_clean.fq in2=EthFoc-11_2_clean.fq outm=EthFoc-11_Vs_ASM175734v1_mapped3.sam outu=EthFoc-11_Vs_ASM175734v1_UNmapped3.sam ref=GCA_001757345.1_ASM175734v1_genomic.fna mhist=EthFoc-11_Vs_ASM175734v1_BBmap_mhist3.txt qhist=EthFoc-11_Vs_ASM175734v1_BBmap_qhist3.txt qahist=EthFoc-11_Vs_ASM175734v1_BBmap_qahist3.txt


The STDOUT is available at http://textuploader.com/d697t

The file listing for this run is as follows:

-rw-rw-r-- 1 aksrao aksrao  891 Aug 19 22:29 EthFoc-11_Vs_ASM175734v1_BBmap_qahist3.txt
-rw-rw-r-- 1 aksrao aksrao  15K Aug 19 22:29 EthFoc-11_Vs_ASM175734v1_BBmap_mhist3.txt
-rw-rw-r-- 1 aksrao aksrao 6.7K Aug 19 22:29 EthFoc-11_Vs_ASM175734v1_BBmap_qhist3.txt
-rw-rw-r-- 1 aksrao aksrao 2.3G Aug 19 22:29 EthFoc-11_Vs_ASM175734v1_UNmapped3.sam
-rw-rw-r-- 1 aksrao aksrao  14G Aug 19 22:29 EthFoc-11_Vs_ASM175734v1_mapped3.sam


What do you think / advice? Also, if you have any answers to the numbered questions in my previous comment, I welcome them all. Thanks!

1
Entering edit mode

I think I have a possible solution here. I am guessing that your original BBDuk run was not with PhiX (phix174_ill.ref.fa.gz), but with phix_adapters.fa.gz. Those are very different and for different purposes! Most of the time you do not need to use phix_adapters.fa.gz at all.

This became clear when I saw that your first command:

bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt


...reported 3088648 matched R2. And later, you mention a different reference (phix_adapters.fa.gz), but still report the exact same number of Phix adapter 2 observed. This would never happen if the first command actually ran vs PhiX rather than PhiX adapters.

So, probably you have little or no PhiX, but a lot of short inserts, and what you are actually seeing is that R2's adapter coincidentally matches PhiX R2 adapter, but R1's adapter does not match. If this theory is correct, it makes some of your questions no longer applicable.

You should go back and rerun BBDuk (on both reads at once) using phix174_ill.ref.fa.gz as the reference, and see what kind of results you get; probably, there is not much PhiX. Sorry that the naming is a little confusing; I don't know where the name "phix174_ill.ref.fa.gz" came from, but I just kept it (and it is kind of descriptive, anyway).

As for your mapping results - the data may or may not be low quality; 4% is a really high substitution rate. But given that the insert size is low, possibly most of those mismatches come from adapter sequence. I suggest adapter-trimming using the full adapter set:

bbduk.sh in1=r1.fq in2=r2.fq out1=trimmed1.fq out2=trimmed2.fq ftm=5 ref=adapters.fa k=21 mink=11 hdist=1 ktrim=r tpe tbo


...then you can map again and should see a much lower error rate (hopefully). You can already tell that these files go together (are from the same library) because the pairing rate is high. If you use files from different libraries you'll typically get a "proper pair" rate of <0.1%

As for training - JGI does not offer free training. We do have a workshop a couple times a year (for a fee) but it's kind of specialized for using JGI's resources for annotating and submitting microbial assemblies, and has less to do with NGS, so I don't think it would be applicable. However, I think it would be fun to set up an NGS workshop as well. I'll talk to some people about it. Unlikely that it would be completely free, but I'm not really sure what costs are involved with the current workshop (other than catered lunch) so it could probably, at least, be pretty cheap. If there was no lunch, there's no reason it couldn't be free.

1
Entering edit mode

I don't know where the name "phix174_ill.ref.fa.gz" came from, but I just kept it (and it is kind of descriptive, anyway).

As I recall, Illumina's phiX has a SNP (or more) compared to the canonical phiX sequence so I assume the name must have included a reference to Illumina to indicate that difference.

0
Entering edit mode

Yes, I should've used the PHI-X174 genome, not the adapters. Thank you, Brian!

About k, since my adapters are both 34nt, I think I'll be using k=34, right?

And this value of mink, how low can it be / should it be? Could you or someone please explain the logic for that choice of value? I tried with mink=11 vs. mink=3, both times with hdist2=0, and obtained different fastqc results in terms of over-represented k-mers (a few Vs none, respectively)

And hopefully my final question in this thread - how does one determine the minimal read length that is allowed after adapter trimming, quality trimming? Is it determined empirically / theoretically / both / neither? I could NOT find an easily comprehensible answer to this, online. Using kmerexactcount.sh the peak of k-mer distribution is 41nt. So I am guessing my read lengths should be at the very least > 41. Yes / no / may be? Also how large can it be? Obviously not larger than 150nt, which is max length. But 41-150 is a wide size window (assuming I am right). Sorry if this is a naïve question. Thanks!

2
Entering edit mode

how does one determine the minimal read length that is allowed after adapter trimming, quality trimming? Is it determined empirically / theoretically / both / neither?

That is something for you to decide. As the length of the remaining part decreases so does the aligner's ability to uniquely map it on the genome. So depending on how you are treating multi-mapping reads this can either add no useful information, add noise or just not map at all. For a 150 bp read you could say that if the length drops more than two-thirds (say 40 bp, minlen=40 for bbduk) then you will discard it.

1
Entering edit mode

About k, since my adapters are both 34nt, I think I'll be using k=34, right?

No, use "k=21 mink=11"; that's more robust to sequencing error. "k=34" would be more precise but much less sensitive.

As for mink - you can use 3 if you want, but I prefer 9 or higher. You already catch most adapters with "tbo" for paired reads (when you trim both together). A low value of mink will increase the removal of adapters slightly, but also cause false positives which incur sequence bias. If you remove 99% of the adapter sequence present the remaining 1% really does not hurt you, whereas going to, say, "mink=1 hdist2=1" would ensure 100% of the reads would be trimmed by at least 1bp, which is not very useful (but, oddly, some trimming tools do that by default). Still, you could go below 11 if your project is especially sensitive to adapters (such as trying to do an OLC assembly while allowing zero mismatches).

0
Entering edit mode

Brian Bushnell - Based on replies from you and genomax in this thread, I've been able to complete nearly all of the pre-processing step relevant to my dataset, as listed at http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/data-preprocessing/. Thank you!

I have just one step - Error Correction, that I've run into some issues, but I've reached out to the forum members for help again, just now.

In terms of free-of-cost workshops, it occurred to me that may be K-Base may be interested in hosting something where you could be an instructor, OR Cyverse? My understanding is that K-Base and JGI both fall under the same DOE umbrella - so may be there is natural alignment you could explore? Just a thought....

0
Entering edit mode

Yes, Kbase and JGI do fall under the same umbrella. And yes, I am talking to people at JGI about a new workshop, and I think Kbase would be interested as well. I'm also in contact with someone at Kbase.

I'll post on Biostars if I can set something up!