Biostrings extract read
1
0
Entering edit mode
5 weeks ago
marco.barr ▴ 80

Hello everyone, I am working on extracting the longest possible reads (total length 5698bp) from a fastq file (long reads from nanopore) while discarding those with lengths equal to 20bp and 498bp. I need to use Biostrings. I have written these commands:

fastq <- readDNAStringSet("HEVA_long.fastq", format = "fastq")
reads_to_remove <- which(width(fastq) == 20 | width(fastq) == 498)
filtered_fastq <- fastq[-reads_to_remove]

Do you have any suggestions for me? Is it correct? Alternatively, for instance using seqkit in bash, how could I achieve this? Thank you very much.

R fastq Biostrings • 413 views
ADD COMMENT
0
Entering edit mode

What is the odd requirement with 20 and 498? Are those adapters or some such?

You can use reformat.sh from BBMap suite to filter reads based on size. You can use lhist=filename to plot a histogram of read distribution.

reformat.sh -Xmx4g in=read.fq out=filtered.fq minlength=499 lhist=file.lst
ADD REPLY
0
Entering edit mode

The reason is that on IGV I have very unbalanced peaks of coverage at these base lengths. I just wanted to clean up the IGV visualization because I have to give a presentation. So, I wanted to try this. Does this filter out all reads shorter than 498 base pairs?

ADD REPLY
0
Entering edit mode
5 weeks ago
cat HEVA_long.fastq | paste - - - - | awk -F '\t' '{L=length($2);if(L!=20 && L!=498) print;}' | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

I tried your commands, but I'm not getting what I want

ADD REPLY

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6