Question

Remove reads from FASTQ file based on missing fixed base

0

Entering edit mode

2.6 years ago

c.heininger ▴ 10

Hello everybody, I have a question regarding processing raw FASTQ files based on a specific UMI approach.

Basically, we employed a strategy to our paired-end sequencing experiment, where we use 6nt UMIs in our library. Following the UMI sequence is a fixed base, that is the same for every R1 and every R2 (different between R1 and R2 of course).
This results in most reads having the same base at position 7 (confirmed with FastQC).

An example of a read I want to keep is this:

@NB551018:503:HV5CTAFX2:1:11101:9572:1046 1:N:0:ACNTGA   
TTATTNACTAGCTGCGTTCTTCATCGACGCACGAGCCGAG   
+  
A/AA/#EEEEEEAEAEEEEEEEEEAEEEEEAEE//EEEEE

Here the 7th base is an A, which marks this as a read that is to be kept.

An example of read to be discarded is this:

@NB551018:503:HV5CTAFX2:1:11101:14783:1047 1:N:0:ACNTGA  
GCGACNCTATCCACCCAAAGGATAAACATTTATCATACCA  
+  
AAA/A#EAEEEEEEEA<AEEEEEAEEAAEEEE/AEEEEAE

The 7th base here is a C, meaning I want to discard this read.

My question now is: How do I remove all reads that do not meet the condition of this "fixed 7th base"? Is there a good way to do this in Linux (maybe with grep, awk...?) I am not yet very well versed in using Linux and it's built in tool for manipulating and processing files, hence the question.
Or ist there a specialized tool, that can do this, that I am not aware of?

I am very grateful for any help!
-Chris

edited: Added examples for what data looks like and what I want to keep/discard.

processing FASTQ linux • 1.0k views

ADD COMMENT • link updated 2.5 years ago by cpad0112 21k • written 2.6 years ago by c.heininger ▴ 10

0

Entering edit mode

If you could post example data including UMI sequences and expected output, for few records, that would help in understanding the issue. You can use cutadapt to do the job.

ADD REPLY • link 2.6 years ago by cpad0112 21k

0

Entering edit mode

I edited the post, thank you. Though I do not think cutadapt can do exactly this?

ADD REPLY • link 2.5 years ago by c.heininger ▴ 10

0

Entering edit mode

Try following code:

with seqkit:

$ seqkit grep -srip ^.{6}A file.fq

with cutadapt, try this:

$ cutadapt  -g ^"N{6}A" --discard-untrimmed --action=retain --quiet file.fq

Since data is PE, use appropriate command for PE. This is for SE.

ADD REPLY • link 2.5 years ago by cpad0112 21k