Tool: Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias
10
gravatar for Brian Bushnell
2.9 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I'd like to introduce a new member of the BBMap package, FilterByTile. It's intended to increase the quality of libraries without incurring bias, or to help salvage libraries with major positional quality problems (like flow-cell bubbles).

Overview

The quality of Illumina reads is dependent on location in a flowcell. Some areas are in poor optical focus, or have weaker circulation, or air bubbles, and thus can have very low-quality reads that nonetheless pass Illumina’s filter criteria. While these reads usually have below-average quality scores, it requires very aggressive quality-filtering to remove all of the reads with positionally-related low quality. Aggressive quality filtering and trimming can, in turn, cause detrimental impacts on analysis because sequence quality is also sequence-dependent; thus, aggressive filtering can incur bias against extreme-GC portions of the genome, or specific motifs. This may yield poor assemblies, incorrect ploidy calls, bad expression quantification, and similar problems.

FilterByTile is designed to filter low-quality reads based on positional information. By removing only a small fraction of reads - those in the lowest-quality areas of the flowcell – the overall quality of the data can be increased without incurring sequence-specific bias. The default settings of FilterByTile typically remove on the order of 2% of the reads, while reducing the overall error rate by substantially more than 2% (on the order of 10%). Essentially, it gets rid of the worst of the worst.

FilterByTile was originally developed after observing spikes in the kmer-uniqueness plot used to calculate library complexity, in what should have been a monotonically-declining exponential decay curve (generated by bbcalcunique.sh); these spikes corresponded to low-quality locations on the flow-cell. Interestingly, the spikes often have a regular period, indicating a structured pattern such as flow-cell edges, tile edges, or a “streak”. The initial goal of FilterByTile was simply to eliminate these spikes to allow better estimation of library size and complexity, but it can be useful for generally improving library quality as well.

Notes

How it works:

Illumina read names contain information about each cluster’s lane, tile, and X,Y coordinates. FilterByTile scans all reads in the file and calculates the average quality score for a given position. Additionally, the average kmer-uniqueness rate is calculated by position; for data with sufficient depth, this can be used as a proxy for error-rate, allowing filtering of data with inaccurate quality scores.

To calculate a useful average quality for a position, sufficient reads are needed. So, reads are aggregated by position into rectangular “micro-tiles”; these micro-tiles are iteratively expanded until the average micro-tile contains at least X reads (default 800). Then, the averages are calculated on a per-micro-tile basis, standard deviations are calculated, and for tiles at least Y standard deviations worse than normal, all reads are discarded together. Thus, smaller micro-tiles allow more precise positional filtering, but larger micro-tiles yield more accurate quality-score averages. Arbitrary shapes such as circles outlining bubbles would be optimal, but there are no plans for this.

How and when to use, or not:

FilterByTile is applicable to any Illumina HiSeq, MiSeq, or NextSeq sequence. However, it depends on large volumes of data for statistics; it’s useless to run it on a set of 4000 reads demultiplexed from a much larger run. In that case, it would be better to use the “dump” flag to dump the statistics from all libraries in the run together, then use the “indump” flag to filter the libraries individually. That way, quality statistics gathered from all reads will be applied to each individual library.

The filtering should be beneficial in most cases - particularly when you want to salvage a library that obviously had bubbles or low-flow streaks in the lane, but also for libraries with no dramatic positional quality issues. However, there are some cases – such as complex metagenomes - in which more coverage is strictly beneficial, so throwing away even low-quality reads is a bad idea. In these cases, or any situation where very low coverage is expected, filtering will often lead to inferior results. With high coverage, FilterByTile should be strictly beneficial, but with extremely low coverage (or with single cell/metagenome data, in which coverage is highly variable and so low coverage is expected), I do not recommend using it. In those cases it might be useful, but at least in my tests of metagenomes, it yields inferior results. All coverage is useful when you have insufficient coverage and are trying to assemble things with 2x coverage, as in the case of single cells and metagenomes. Therefore, any program that removes reads tends to be detrimental in that case. But I recommend it for isolates, or basically, in any situation where you expect >5x average coverage.

Read names:

FilterByTile depends on read headers to identify flowcell location. It has been validated with HiSeq, MiSeq, and NextSeq data, but different Illumina demultiplexing/base-calling software versions have different naming conventions, so please contact me if you see Illumina names that it can’t parse. Renamed reads (such as those in the SRA) probably won’t work.

Memory:

FilterByTile should not need too much memory, but if it runs out of memory it will generally be due to calculating kmer uniqueness for a large genome. In this case, the “usekmers=f” flag will ignore kmers and just use quality scores; in that case, it won’t run out of memory.

Usage Examples

Single-ended or paired/interleaved files:

filterbytile.sh in=reads.fq.gz out=filtered.fq.gz

Paired reads in twin files:

filterbytile.sh in1=r1.fq in2=r2.fq out1=filtered1.fq out2=filtered2.fq

Filtering using a statistical profile from multiple libraries:

cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell
filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell

Filtering aggressively (when you know there’s a serious problem):

filterbytile.sh in=x.fq out=y.fq ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5

Disabling kmer uniqueness to increase speed and decrease memory usage:

filterbytile.sh in=x.fq out=y.fq usekmers=f
ADD COMMENTlink modified 15 months ago by AllisonAnne0 • written 2.9 years ago by Brian Bushnell17k

Hello thanks for the nice explanation. I have exactly this problem, indeed in my reads sequenced with Illumina Machine I have noticed (after performing the quality control with fastqc) that specific tiles (or flow cells) had problems during the sequencing (bubbles during the loading etc..). Now from the quality control I know where exactly are these tiles and which read correspond to that tile (fifth value in the illumina reads below in black):

@ST-E00204:114:HHKTJALXX:4:1101:22962:1538_1:N:0:1/1

Is there a way this tool is able to remove reads according to specific reange of "tile" numbers? LIke for example from 1101 to 1400?

Thanks in advance, Cheers

ADD REPLYlink written 2.5 years ago by fusion.slope200
2

If you have access to the original data folder and bcl2fastq software (or MiSeq resporter) you can add an option called --ExcludeTiles 1101-1400 (to the samplesheet in case of MiSeq reporter reanalysis) to exclude tiles between those numbers during the analysis. --ExcludeTilesLaneX (replace X by lane #) should be used for HiSeq FC.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax75k

Thanks for the reply. No I do not have the access to all the files (at the moment) so I was trying to do it directly from my fastq files (already merged in a single one)...

ADD REPLYlink written 2.5 years ago by fusion.slope200
1

Despite the name, FilterByTile does not actually filter out reads from specific tile numbers that you give it. It's all automatic and it filters areas much smaller than tiles. I suggest you try it; even though you can't tell it which tiles to filter, it should automatically detect which areas need to be filtered. I'd be interested in hearing if it does a worse job than you can do manually.

ADD REPLYlink written 2.5 years ago by Brian Bushnell17k
1

let's see I will try.. thanks Brian

ADD REPLYlink written 2.5 years ago by fusion.slope200

Hi Brian, I have tried 'filterbytile' from bbmap and am doing now the quality check with fastqc. Meanwhile i made a unix command to remove all the reads that are in those tiles signed as red in the heat map.

Before removing: Before
pic upload edit

After Removing: After
get url for picture

As you can see there is an improve already removing those tiles. Now I will perform the mapping to see if there is an improve in the mappability of the reads also. Lets wait for the fastqc output based on the bbmap "filterbytile.sh" script. Cheers

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by fusion.slope200
1

It's odd that the results would be identical, if FilterByTile was operating correctly. Would you mind posting the stderr output when running it?

ADD REPLYlink written 2.5 years ago by Brian Bushnell17k

o wait, i did the quality with the same file :((, let me modify the previous report and lets wait for the output of the quality check with the output of "filterbytile.sh" script.. i think in less than 1 hours i have the output in the meantime here is the output of the script:

java -ea -Xmx91147m -Xms91147m -cp /project/jgu-cbdm/andradeLab/scratch/tandrean/Tools_Script/bbmap/current/ hiseq.AnalyzeFlowCell in1=/project/jgu-cbdm/andradeLab/scratch/tandrean/Data/WGBS/test/Fastq.Trimmed/AS-165429-LR-25517_R1_val_1.fq in2=/project/jgu-cbdm/andradeLab/scratch/tandrean/Data/WGBS/test/Fastq.Trimmed/AS-165429-LR-25517_R2_val_2.fq out1=clone.6.Q20.R1.fq out2=clone.6.Q20.R2.fq
Executing hiseq.AnalyzeFlowCell [in1=/project/jgu-cbdm/andradeLab/scratch/tandrean/Data/WGBS/test/Fastq.Trimmed/AS-165429-LR-25517_R1_val_1.fq, in2=/project/jgu-cbdm/andradeLab/scratch/tandrean/Data/WGBS/test/Fastq.Trimmed/AS-165429-LR-25517_R2_val_2.fq, out1=clone.6.Q20.R1.fq, out2=clone.6.Q20.R2.fq]

Set INTERLEAVED to false
Loading kmers:      2061.757 seconds.
Filling tiles:      3172.923 seconds.

Flagged 98315 of 890240 micro-tiles, containing 71816084 reads:
58791 exceeded uniqueness thresholds.
73880 exceeded quality thresholds.
68528 exceeded error-free probability thresholds.
713 had too few reads to calculate statistics.

Filtering reads:    2542.997 seconds.

Time:                           7778.524 seconds.

Reads Processed:        935m    120.23k reads/sec
Bases Processed:     134098m    17.24m bases/sec

Reads Discarded:      59127k    6.322%
Bases Discarded:       8038m    5.995%
ADD REPLYlink modified 2.5 years ago by genomax75k • written 2.5 years ago by fusion.slope200
1

I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing.

101010 Button

ADD REPLYlink written 2.5 years ago by genomax75k

yes thanks lets wait for the output also...

ADD REPLYlink written 2.5 years ago by fusion.slope200

Dear Brian, Dear fusion.slope, Could you add the fastqc pictures how FilterbyTile affects the quality? Do I understand right that the pictures above show the results of a script you wrote yourself and that you removed complete tiles? I believe your picture after FilterbyTile is still missing.

I can provide my input fastqc and result fastqc with default settings:

filterbytile.sh in=reads.fq.gz out=filtered.fq.gz

input at https://ibb.co/i2JPzF and result at https://ibb.co/eA3Dta

cutadapt_filterbytile_S9

cutadapt_filterbytile_S9

Is this correct behavior by BBtools and/or FASTQC. It looks almost worse after filtering. Or is this way Fastqc calculates averages for this image?

Any feedback or help appreciated. Best!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by br.m0

It's hard to say; I don't personally use FastQC and I don't know how it chooses colors. But I think the red tiles probably had 100% of data removed. It might be helpful if you could post the screen output of FilterByTile.

If you want to evaluate the quality of the data, and you have a reference, I would recommend mapping the data (before and after filtering) to the reference with BBMap and posting the resulting histograms:

bbmap.sh in=reads.fq ref=ref.fa mhist=mhist.txt ehist=ehist.txt qhist=qhist.txt aqhist=aqhist.txt

BBMap also prints to the screen the mapping rate and error rates. The mapping rate should be high and the error rate should be lower post filtering.

ADD REPLYlink written 2.2 years ago by Brian Bushnell17k

Dear Brian, thanks a lot for your answer. In between I wrote an awk-oneliner and manually filtered the base pairs according to tile and position. Further I checked the fastq-output and I can confirm that the red-lines are generated because there is no data left.

For completeness I added the screen output of your tool, in case this helps. I will map the data and compare pre and post filtering.

Loading kmers:          48.839 seconds.
Filling tiles:          64.783 seconds.
Widening to 500x1000
Widening to 1000x1000
Widening to 1000x2000
Widening to 2000x2000
Widening to 2000x4000
Widening to 4000x4000
Widening to 4000x8000

Flagged 1850 of 7486 micro-tiles, containing 1198243 reads:
1277 exceeded uniqueness thresholds.
447 exceeded quality thresholds.
785 exceeded error-free probability thresholds.
19 had too few reads to calculate statistics.

Filtering reads:     56.119 seconds.

Time:               172.009 seconds.

Reads Processed:       7791k    45.29k reads/sec
Bases Processed:        741m    4.31m bases/sec

Reads Discarded:        861k    11.055%
Bases Discarded:      82529k    11.135%
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by br.m0

Hi br.m,

sorry for the late reply. Basically i did a bash script removing the tiles in the red color. Unfortunately the mappability of my samples removing those tiles remains the same.. As I am understanding from your pictures, you have even worst "per tile sequencing quality". I am sorry but I do not know the reason.

To be sure try to make your own script and remove the tiles from 2100 to 2116 and check if this improve your mappability. In my case it did not..

I can't help more. Best

ADD REPLYlink written 2.2 years ago by fusion.slope200

Mappability issues may be unrelated to tiles. Are you sure that only certain tiles are exhibiting mappability problem? What fraction of reads are not mapping and what exactly is this experiment trying to achieve? Have you taken some of the un-mapped reads and blasted them at NCBI? You may find that you simply may have contamination that is unexpected.

ADD REPLYlink written 2.2 years ago by genomax75k

I took the un mapped reads and blasted them without any success. I have mapped the un mapped reads to the genome of lambda virus and micoplasma with not success (in both cases very few reads mapped ). I am speculating that the mappability is related with the tiles because according to the quality of my tiles i had a linear relationship with the mappability (i.e. the bad the quality of the tiles = the lower the mappability of reads). But then given that removing those tiles did not improve the mappability I thought that it was not related to that.

ADD REPLYlink written 2.2 years ago by fusion.slope200

That is odd. What fraction of reads are unmapped? Are the reads that are mapping doing so with good quality (are you using BBMap for this)?

If you are certain that you are doing everything right then I would suggest contacting the service provider and checking if the run this sample was on ran into some kind of software/hardware issue.

ADD REPLYlink written 2.2 years ago by genomax75k

in two samples we had ~ 60% of the reads un mapped in other 2 we have 35% (each sample had ~ 450 mil reads in paired end so ~900 mil). Why hardware and software issues? I have done all the checksums and the files were transfered in the correct way (output of checksums was ok). Could you please tell me why you think is a software issue?

Also i have used BBMap but I did not have any interesting information.

Thanks in advance

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by fusion.slope200
1

Hardware/software/reagent issues on the sequencer itself. Generally sequencing providers should not release data that had such problems (we don't) but one never knows.

That is a fairly large % of unmapped reads that do not seem to belong to your genome of interest (or to anything interesting after blast) correct? I suggest that you contact the sequence provider and see if they can shed any light on what is going on.

ADD REPLYlink written 2.2 years ago by genomax75k

Hi there,

I am a masters student and this is my first time assembling a genome.

I have paired-end Illumina sequencing data from a plant genome (approximately 2.5Gb, with a high percentage of repetitive sequences).

The quality of the heatmaps from the data obtained from the HiSeq 2500 run, all fail. I ran FilterByTile on each file, and visually the quality improved, but the quality check still failed on FastQC (after adapter removal, trimming and quality filtering were performed).

My question is whether I should rather use the data without running FilterByTile (and have a higher percentage of coverage), or use the data on which I ran FilterByTile (even though the quality check still fails).

When using the aggressive parameter with FilterByTile, I lose 11-14% of my data. When using the normal setting, I lose 5% of data.

Kind regards, Allison

ADD REPLYlink written 15 months ago by AllisonAnne0

@Allison,

Can you post the initial QC plot of your data (How to add images to a Biostars post )?

Failing a test in FastQC is to be taken with a grain of salt. It does not immediately mean that you have bad data. You need to consider those result in context of experiment.

ADD REPLYlink written 15 months ago by genomax75k

Thank you for the reply.

Here are screenshots from one of my datasets.

Per base Per sequence Per tile

ADD REPLYlink written 15 months ago by AllisonAnne0

Are you sure the was no hardware/software problem with this run? I would start by asking the sequence provider first. If the per tile plot is accurate my hunch is there was a problem with this run.

ADD REPLYlink written 15 months ago by genomax75k

So I contacted the sequence provider, and this is the reply:

"It does not appear that we resequenced that library. I pulled the data and it does appear that we have lower clusters passing filter (50-60%) than I would expect from a good run, and a slightly higher error rate. The only other thing that pops out as questionable is the second index (read 3) has pretty low quality. On the off chance, we did look to see if we still had libraries stored, and we do not. We generally only store for ~6 months."

How usable is the data? Should the sequence provider not have re-sequenced the run?

Kind regards, Allison

ADD REPLYlink written 14 months ago by AllisonAnne0

How usable is the data?

Since you are de novo assembling a genome using this data is probably not advisable but if you have no other option then you could with a big lump of salt.

Should the sequence provider not have re-sequenced the run?

They should have but it looks like you (or someone else) waited for 6 months+ before looking at this data set. If you do have some of the sample stored you can always make new libraries and sequence them. If you do have some libraries left then you could ask the sequence provider to sequence them (at least for a reduced cost, if not for free).

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax75k

Thank you for the advice!

ADD REPLYlink written 14 months ago by AllisonAnne0

Hi,

I've downloaded some SE RNA-Seq samples and ran FastQC on the files. After inspection of the output, I decided that I would want to filter out the worst reads from the tile if FastQC reports a "fail" for both the "per base sequence quality" and the "per tile sequence quality" before I map the reads. I hope to this way improve the phred score of the reads to an acceptable level. The pictures below are an example of a sample whose per base/tile quality both resulted in a "failed" FastQC output.

Per Base Sequence Quality SRR1121303 Per Tile Sequence Quality SRR1121303

That said, I have a problem with filterbytile.sh. I tried to use the tool in a loop, but I can't seem to generate any output files, which prevents me from checking if redoing FastQC improves the results. This, because I receive an AssertionError. I unfortunately don't understand how to solve the error. So could someone explain to me what the error is and how I could solve it? I posted my code and the output that I got.

Code that I used:

module load BBMap/x86_64/38.20

for num in {1121292..1121303}
do
        fastqc -o FastQC --extract -f fastq SRR$num.fastq

        base=$(grep -E 'Per base sequence quality' /group/biocomp/users/cadav/Thesis/Transcriptome/RNA-Seq/GSE54153/FastQC/SRR${num}_fastqc/fastqc_data.txt)
        tile=$(grep -E 'Per tile sequence quality' /group/biocomp/users/cadav/Thesis/Transcriptome/RNA-Seq/GSE54153/FastQC/SRR${num}_fastqc/fastqc_data.txt)
        result1=$(echo $base | tail -c 5)
        result2=$(echo $tile | tail -c 5)

        if [ $result1 = 'fail' -a $result2 = 'fail' ]
        then
                filterbytile.sh in=SRR$num.fastq.gz out=SRR${num}_filtered.fastq.gz
        fi
done

Output:

java -ea -Xmx6471m -Xms6471m -cp /software/shared/apps/x86_64/BBMap/38.20/current/ hiseq.AnalyzeFlowCell in=SRR1121292.fastq.gz out=SRR1121292_filtered.fastq.gz
Picked up JAVA_TOOL_OPTIONS: -XX:ParallelGCThreads=1
Executing hiseq.AnalyzeFlowCell [in=SRR1121292.fastq.gz, out=SRR1121292_filtered.fastq.gz]

Loading kmers:          65.198 seconds.
Filling tiles:          Exception in thread "main" java.lang.AssertionError: SRR1121292.1 FCD1FH7ACXX:7:1101:1220:2100/1
        at hiseq.FlowCell.getMicroTile(FlowCell.java:190)
        at hiseq.AnalyzeFlowCell.fillTilesInner(AnalyzeFlowCell.java:585)
        at hiseq.AnalyzeFlowCell.fillTiles(AnalyzeFlowCell.java:326)
        at hiseq.AnalyzeFlowCell.process(AnalyzeFlowCell.java:264)
        at hiseq.AnalyzeFlowCell.main(AnalyzeFlowCell.java:51)
ADD REPLYlink modified 13 months ago • written 13 months ago by Nottuh10

I would suggest just do regular bbduk.sh scan and trimming on this data. While it does look like a few final cycles have reads bad Q-scores the median value still is high (so it must be a few bad reads from that block of tiles). If you are aligning to a good reference then things should work fine.

BTW: It looks like you must have fastq-dumped this data from SRA. Looks like the loop is failing with the first file so you could start the loop at 1121293 and see if the jobs go through. You may have a corrupt file.

ADD REPLYlink written 13 months ago by genomax75k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1791 users visited in the last hour