Question

Remove masked (N) based from reads in fastq library

0

Entering edit mode

6.6 years ago

TWV ▴ 90

I am trying to remove all masked bases from a fastq library of reads. The reads are as follows: GTCAGTCNNNNNNNNN with the Ns at the 3' end. Is there some script or comand to remove the Ns from the reads?

masked cutadapt reads remove • 2.8k views

ADD COMMENT • link updated 6.6 years ago by chen ★ 2.5k • written 6.6 years ago by TWV ▴ 90

score 1 · Answer 1 · 2017-12-22

1

Entering edit mode

6.6 years ago

Pierre Lindenbaum 163k

gunzip -c input.fastq.gz |\
paste - - - - |\
awk -F '\t' '{L=length($2);for(i=L;i>0;i--) if(substr($2,i,1)!='N') break; printf("%s\n%s\n%s\n%s\n",$1,substr($2,1,i),$3,substr($4,1,i));}'

ADD COMMENT • link 6.6 years ago by Pierre Lindenbaum 163k

score 1 · Answer 2 · 2017-12-23

fastp may help you to do that. You can use the function per read cutting by quality score. Specify -3 option to enable it on 3' end.

fastp is a ultra-fast open-source FASTQ preprocessing tool developed in C++, with following features:

filter out bad reads (too low quality, too short, or too many N...)
cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
trim all reads in front and tail
cut adapters. Adapter sequences can be automatically detected,which means you don't have to input the adapter sequences to trim them.
correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.
report JSON format result for further interpreting.
visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing. Two modes can be used, limiting the total split file number, or limitting the lines of each split file.
support long reads (data from PacBio / Nanopore devices). ...

The project is at:https://github.com/OpenGene/fastp