Question: trim special lenght from special tile in a fastq file
0
gravatar for mary
3.0 years ago by
mary0
mary0 wrote:

when checking the quality of my fastq files from a paired end illumina genome sequence, reads form a special tile show low quality after a special length ( tile 2117 at around 174 to end). I have no idea how to work on this. should I just ignore this and let it be the way it is? should I remove reads from the entire tile? is there a way to trim special length of reads from that tile only and not all reads in fastq file?

this is the heatplot of per tile sequence quality

ADD COMMENTlink modified 3.0 years ago by Petr Ponomarenko2.6k • written 3.0 years ago by mary0

Why are you worried about a single tile? If you feel there is some anomaly with that tile then ignore it. filterbytile.sh from BBMap should help. (Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias : Pay attention to how it works).

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by genomax78k
2
gravatar for Petr Ponomarenko
3.0 years ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.6k wrote:

This looks like a systematic error of your sequencing equipment. Remove such errors is a good idea if you can. As of CASAVA 1.8 it converts bcl to fastq with a naming convention described in here https://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>

So you can use the naming convention to find reads from your tile and trim them from 174 to the end.

for example in read name @D00360:94:H2YT5BCXX:1:2215:16765:82760/2 tile is 2215 and "@D00360" is your instrument

This can be done with awk one-liner for your instrument @D00360, tile 2117 and trim length of 174:

awk 'BEGIN{instrument="@D00360";tile=2117;trim_len=174;FS=":"}($1!=instrument||$5!=tile){print $0}($1==instrument&&$5==tile){print $0;getline; print substr($0,1,trim_len);getline; print $0; getline; print substr($0,1,trim_len)}' your_fastq_file.fastq

awk reads file line by line. BEGIN defines your constants. FS=":" is field separator to split by ":" for the name. ($1!=instrument||$5!=tile) is true for any line other than for your tile in from your instrument name. These lines are printed out. If awk gets a name of the read for your tile of interest it prints it, reads new line and trims it by the length defined. Then awk gets one more line to print "+" from fastq format and finally reads the line of qualities and trims it the same way.

Hope this helps. Also very important to see the distribution of qualities for that tile at different nucleotide positions in the read. The reason is that for systematic errors the mean is not zero and the best way is to remove it by the mean. So if you see some reads are starting to get very bad at position 160 while other at 190 with position 174 on average, then trimmin by 174 is a good idea. If you will trim by 160 you will loose some information that is not noise!

Thank you.

ADD COMMENTlink written 3.0 years ago by Petr Ponomarenko2.6k

Also whether to trim or not and if trimming at 160 or 174 is a better way depends on your experiment. If you try to find differences in the sequence and check if it is heterozygous or homozygous like in WGS or BS-seq, then removing and trimming away more potential errors usually is a good idea (unless you have very little amount of DNA or it is very degraded and short like with ancient DNA). But when the coverage is more important like in ChIP-seq data, removing way too much correct information with the noise can lead to some strange results. So as always this is about sensitivity and specificity and about what hypothesis you are testing =)

ADD REPLYlink written 3.0 years ago by Petr Ponomarenko2.6k

Thanks you helped a lot! just one question: I run awk by adding >outfile.fastq to print the trimmed reads to outfile.fastq Now when I open this outfile.fastq with FastQC, the "per tile sequence quality" fails, showing a red path in tile 2117 from nucleotide number 174 to end. this is because there are no quality and sequences after 174 which is shown as zero quality. generally this should be of no problem I think, right?

ADD REPLYlink written 3.0 years ago by mary0

Fastqc thinks your file has problems with a particular tile, which you do ;)

Depending on what you are planning to do with that trimmed fastq file some other programs might fail or give you warnings as well. You may trim reads from that tile altogether. To do so you can alter that awk one liner by removing print statements in ($1==instrument&&$5==tile){}

ADD REPLYlink written 3.0 years ago by Petr Ponomarenko2.6k

Do you think it is possible to eliminate reads in the fastq file according to specific number of tiles? I know from my quality control that tiles from 1200 to 2000 had problems (bubbles etc..) so I would like to remove them... Thanks in advance for any suggestion

ADD REPLYlink written 2.7 years ago by fusion.slope210

See my answer here. Otherwise you will need to write a custom script to remove that data. That said if the basecalls in those tiles have N's/bad Q scores you may be able to remove them by filtering on those two criteria.

ADD REPLYlink written 2.7 years ago by genomax78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1458 users visited in the last hour