Question

Biostrings extract read

0

Entering edit mode

5 weeks ago

marco.barr ▴ 80

Hello everyone, I am working on extracting the longest possible reads (total length 5698bp) from a fastq file (long reads from nanopore) while discarding those with lengths equal to 20bp and 498bp. I need to use Biostrings. I have written these commands:

fastq <- readDNAStringSet("HEVA_long.fastq", format = "fastq")
reads_to_remove <- which(width(fastq) == 20 | width(fastq) == 498)
filtered_fastq <- fastq[-reads_to_remove]

Do you have any suggestions for me? Is it correct? Alternatively, for instance using seqkit in bash, how could I achieve this? Thank you very much.

R fastq Biostrings • 413 views

ADD COMMENT • link 5 weeks ago by marco.barr ▴ 80

0

Entering edit mode

What is the odd requirement with 20 and 498? Are those adapters or some such?

You can use reformat.sh from BBMap suite to filter reads based on size. You can use lhist=filename to plot a histogram of read distribution.

reformat.sh -Xmx4g in=read.fq out=filtered.fq minlength=499 lhist=file.lst

ADD REPLY • link 5 weeks ago by GenoMax 141k

0

Entering edit mode

The reason is that on IGV I have very unbalanced peaks of coverage at these base lengths. I just wanted to clean up the IGV visualization because I have to give a presentation. So, I wanted to try this. Does this filter out all reads shorter than 498 base pairs?

ADD REPLY • link 5 weeks ago by marco.barr ▴ 80

score 0 · Answer 1 · 2024-03-19

0

Entering edit mode

5 weeks ago

Pierre Lindenbaum 161k

cat HEVA_long.fastq | paste - - - - | awk -F '\t' '{L=length($2);if(L!=20 && L!=498) print;}' | tr "\t" "\n"