Question: Trimming A Fastq File Representing Rna-Seq Data
5
gravatar for Varun Gupta
7.4 years ago by
Varun Gupta1.1k
United States
Varun Gupta1.1k wrote:

Hi Everyone So i am using rna seq data. I have raw fastq files which has read length of around 100bp. I quickly ran the fastqc to check for the quality of my fastq file, and i inferred that i need to remove first 10 bases(even though quality scores are high for them , per base % gc content is higher for first 10 bp). Since my read length is 100 bp and as we go along the read the quality decreases, i would like to generate a fastq file which has reads starting from base position 10 till base position 80.

How can i do that??

Any command would be useful??

Hope to hear back soon

Regards

fastq file • 19k views
ADD COMMENTlink modified 6.9 years ago by lh331k • written 7.4 years ago by Varun Gupta1.1k

I have got the same case here, high quality score and strangely biased GC content and base frequencies. Do you have any idea how/why does this problem emerge?

ADD REPLYlink written 7.0 years ago by Biomonika (Noolean)3.1k

Hi this problem i think occurs becoz of the rna seq done during library preparation since we use random hexamer. I think there is a paper on this which explains it. I will post it here as soon as i find it.

Varun

ADD REPLYlink written 6.9 years ago by Varun Gupta1.1k

Thank you. Will be looking forward to it.

ADD REPLYlink written 6.9 years ago by Biomonika (Noolean)3.1k
2

Hi Here's the link

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/

Hope this solves some of your doubts..

ADD REPLYlink written 6.9 years ago by Varun Gupta1.1k
19
gravatar for lh3
7.4 years ago by
lh331k
United States
lh331k wrote:

Seqtk (input files can be optionally gzip'ed):

seqtk trimfq -b 10 -e 20 reads.fq.gz > out.fq

This trims 10bp from the left end and 20bp from the right end. It is over 10X faster than fastx_toolkit/fastx_trimmer:

hengli@xxxx c_elegans$ time seqtk trimfq -b 10 -e 10 /dev/shm/1.fq > /dev/null

real    0m1.751s
user    0m0.606s
sys     0m1.137s

hengli@xxxx c_elegans$ time fastx_trimmer -f 11 -l 90 -Q33 -i /dev/shm/1.fq -o /dev/null

real    0m26.960s
user    0m10.179s
sys     0m16.722s

hengli@xxxx c_elegans$ time gawk 'NR%2==0{print substr($1,11,80)}NR%2==1' /dev/shm/1.fq > /dev/null

real    0m5.316s
user    0m1.803s
sys     0m3.490s

hengli@xxxx c_elegans$ time perl prinseq-lite.pl -fastq /dev/shm/1.fq -trim_left 10 -trim_right 10

real    0m36.002s
user    0m38.517s
sys     0m9.374s

Seqtk also implements the phred quality trimming algorithm, which is the most sophisticated algorithm. It is still much faster than fastx_toolkit/fastq_quality_trimmer:

hengli@xxxx c_elegans$ time seqtk trimfq /dev/shm/1.fq > /dev/null

real    0m2.631s
user    0m0.746s
sys     0m1.885s

hengli@xxxx c_elegans$ time fastq_quality_trimmer -Q33 -l 30 -t 13 -i /dev/shm/1.fq -o /dev/null

real    0m21.065s
user    0m14.494s
sys     0m6.557s

If you have huge fastq files, seqtk is preferred. It is fast, sophisticated, easy to install and feature rich.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by lh331k

Looks lightening fast, may the fastx people want to restructure their code.

ADD REPLYlink written 7.4 years ago by Sukhdeep Singh9.8k

This looks great, but I have a couple of questions. 1) What are the specs for the file used for benchmarking? I'd like to know about the file used so I can have an idea how other trimming methods compare. 2) When you say "phred quality trimming," do you mean seqtk will quality trim files with numerical Phred values or will it handle Sanger/Illumina encoded Fastq files? I assume you mean the latter, but wanted to make sure.

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by SES8.2k
1

1) It is the first 1 million read1 from SRR065390. Nonetheless, for fixed-length trimming, which file to use does not matter. 2) The phred trimming algorithm is described in the phred (or phrap) documentation. The BWA trimming is derived from that somehow. The phred/bwa trimming algorithms are more robust and theoretically correct than trimming all the bases with quality below a threshold (e.g. the 90th base happen to have higher quality, but bases 80-100bp are generally very poor; we should trim off 20bp instead of 10).

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by lh331k

seqtk is fantastic, but there is a bug with seqtrim when the trimmed-off length is longer than the reads. The output becomes huge with broken format, which seems from the @@@@@@@@\n string. Could not figure out the fix by myself yet.

ADD REPLYlink written 5.8 years ago by yifangt0

Do you have any plans to add colorspace support?

ADD REPLYlink written 7.4 years ago by Farhat2.9k

Fixed-length trimming should work for color-space. Quality-based trimming does not as it may trim from the 5'-end.

ADD REPLYlink written 7.4 years ago by lh331k
12
gravatar for brentp
7.4 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Something like this should work

awk 'NR % 2 == 0 { print substr($1, 11, 80 - 10) } NR % 2 == 1' input.fastq > output.trimmed.fastq

commented version

(NR % 2 == 0) { # this is the pattern, match 2nd and 4th lines (sequence and quality)
     print substr($1, 11, 80 - 10) # print the substring starting at 11 (1-based) and length 70
}
(NR % 2 == 1) # these are the name and "+" lines, just print them as-is. in awk,
              # print is implicit if there is no {} block

that said, you likely should use a trimmer that takes quality into account.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by brentp23k

Thanks brent this worked. If possible could you explain what you did, i have a liittle bit idea of awk but not much. Still learning awk.

ADD REPLYlink written 7.4 years ago by Varun Gupta1.1k

@Varun, I added some comments.

ADD REPLYlink written 7.4 years ago by brentp23k

Thanks brent . It was really helpful. Yes i have taken quality into account. I already ran fastqc on my fastq file and came to conclusion that first 10 bases and bases after 80 are not so good for my analysis. I hope that is ok. Thanks a lot

ADD REPLYlink written 7.4 years ago by Varun Gupta1.1k

I also trim the fastq file use this kind of perl script!

ADD REPLYlink written 7.4 years ago by yatung190

Hi

Can you share your perl script. Regards

ADD REPLYlink written 7.4 years ago by Varun Gupta1.1k

Hello Varun:

The perl script is much alike with the version of Brentp provided.

I think they will have the same function

Do you still needed?

Best Regards,

ADD REPLYlink written 7.4 years ago by yatung190

Hi Callsobing

If you could share it, it would be wonderful. Good to learn new things.

Thanks

Varun

ADD REPLYlink written 7.3 years ago by Varun Gupta1.1k
6
gravatar for Damian Kao
7.4 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can use FASTX toolkit: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

It has a trimmer tool. Also you might want to look at their quality filter tool.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Damian Kao15k

Thanks Dk, just saw that, works smoothly.

ADD REPLYlink written 7.4 years ago by Varun Gupta1.1k

I think you should try FASTQ Quality Trimmer insted of FASTQ/A Trimmer, because this script only remove the bad quality tails of the reads.... When I use this script I remove all the tail that have Q score under 10 and keeps only the reads that remains 50 pb o longer after trimming.

ADD REPLYlink written 7.4 years ago by Geparada1.4k
4
gravatar for Sukhdeep Singh
7.4 years ago by
Sukhdeep Singh9.8k
Netherlands
Sukhdeep Singh9.8k wrote:

From FASTX toolkit :

fastx_trimmer -f 11 -l 80 -i file.fastq -o file_trimmed.fastq

If compressed;

gunzip -c file.fastq.gz | fastx_trimmer -f 11 -l 80 -g -o file_trimmed.fastq.gz

You just need to take care about the encoding of the quality score, if its sanger, then you need to add a '-Q33' parameter to the command. Unlike fastqc, it doesnt picks the encoding format automatically. There is another tools called PrinSeq, written in Perl, if you are eager to explore something new.

Cheers

ADD COMMENTlink written 7.4 years ago by Sukhdeep Singh9.8k

Thanks for above information. As I am using fastq_quality_trimmer with below command for single end(single file) by trimming 7 bp from end of sequence and impoving quality of it. So i need to know is it right or not?? fastq_quality_trimmer -t 7 -Q33 -i file1.fastq -o file1trim.fastq

Thanks

ADD REPLYlink written 6.1 years ago by Ram130

What is right or not, the command or the way?? Check the output of fastqc and then remove!!

ADD REPLYlink written 6.1 years ago by Sukhdeep Singh9.8k

Thanks a lot for replying. Actually command is right but it's showing peculiar result like difference in no of reads after using fastq _qualty _trimmer command. For raw fastq file reads were 16573994 and after it was 16356445. However I am not sure reads should remain same or it can vary???

ADD REPLYlink written 6.1 years ago by Ram130

You are using wrong parameter, read the help carefully.

usage: fastq_quality_trimmer [-h] [-v] [-t N] [-l N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

  [-h]         = This helpful help screen.
  [-t N]       = Quality threshold - nucleotides with lower 
                 quality will be trimmed (from the end of the sequence).
  [-l N]       = Minimum length - sequences shorter than this (after trimming)
                 will be discarded. Default = 0 = no minimum length. 
  [-z]         = Compress output with GZIP.
  [-i INFILE]  = FASTQ input file. default is STDIN.
  [-o OUTFILE] = FASTQ output file. default is STDOUT.
  [-v]         = Verbose - report number of sequences.
                 If [-o] is specified,  report will be printed to STDOUT.
                 If [-o] is not specified (and output goes to STDOUT),
                 report will be printed to STDERR.
ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Sukhdeep Singh9.8k

Thanks! As after correcting parameter and trying every possibility also its showing difference in no of reads in case for fastq_quality_trimmer but not in case for fastx_trimmer. So my question IS "Is it possible to have difference in no of reads in case for fastq quality trimmer, before and after trimming"?

ADD REPLYlink written 6.1 years ago by Ram130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1514 users visited in the last hour