Question

Extract specific reads from fastq or SRA

0

Entering edit mode

2.5 years ago

Ankit ▴ 500

Hi everyone.,

I found one data of my interest from SRA. But the file is too big.

Is it possible to obtain specific reads from SRA (preferably) based on sequence string of my gene or region of interest?

Also if it is possible to do it with fastq file would be manageable?

Basically I want to avoid aligning full data which could be time-consuming and memory intensive. So I want to focus only reads of my interest.

Note:

I tried seqtk but it require read name and I don't know which read name to take.
grepping the sequence is another option but I would lose fastq based information

That's why I am looking for tools based on read subset/seed sequence. eg. AAATTCGC

I would appreciate any help.

Thank you

reads filter fastq sra • 1.1k views

ADD COMMENT • link updated 2.5 years ago by GenoMax 141k • written 2.5 years ago by Ankit ▴ 500

score 1 · Answer 1 · 2021-10-27

1

Entering edit mode

2.5 years ago

GenoMax 141k

You can use bbduk.sh in filter mode providing sequence of your interest via literal= option. You will need the fastq file to be present.

ADD COMMENT • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Thanks.

In case if somebody else want to do similar things, the command could be as follows:

/bbmap/bbduk.sh in=sample1.fastq out=unmatch_sample1.fastq outm=match_sample1.fastq literal=AGTTATTTTTATAGTGGAGAGAGATGGCGTTTAAGTGCAAATTTGTTAGTAGTTTTTT k=58

where k is the length of sequence

ADD REPLY • link 2.5 years ago by Ankit ▴ 500

1

Entering edit mode

That k would get full length sequences. If you need to get partial matches then k should be set to less than 1/2 length of pattern one is searching for.

ADD REPLY • link 2.5 years ago by GenoMax 141k