Remove reads from FASTQ file based on missing fixed base
0
0
Entering edit mode
16 months ago

Hello everybody, I have a question regarding processing raw FASTQ files based on a specific UMI approach.

Basically, we employed a strategy to our paired-end sequencing experiment, where we use 6nt UMIs in our library. Following the UMI sequence is a fixed base, that is the same for every R1 and every R2 (different between R1 and R2 of course).
This results in most reads having the same base at position 7 (confirmed with FastQC).

An example of a read I want to keep is this:

@NB551018:503:HV5CTAFX2:1:11101:9572:1046 1:N:0:ACNTGA
TTATTNACTAGCTGCGTTCTTCATCGACGCACGAGCCGAG
+
A/AA/#EEEEEEAEAEEEEEEEEEAEEEEEAEE//EEEEE


Here the 7th base is an A, which marks this as a read that is to be kept.

@NB551018:503:HV5CTAFX2:1:11101:14783:1047 1:N:0:ACNTGA
GCGACNCTATCCACCCAAAGGATAAACATTTATCATACCA
+
AAA/A#EAEEEEEEEA<AEEEEEAEEAAEEEE/AEEEEAE


The 7th base here is a C, meaning I want to discard this read.

My question now is: How do I remove all reads that do not meet the condition of this "fixed 7th base"? Is there a good way to do this in Linux (maybe with grep, awk...?) I am not yet very well versed in using Linux and it's built in tool for manipulating and processing files, hence the question.
Or ist there a specialized tool, that can do this, that I am not aware of?

I am very grateful for any help!
-Chris

edited: Added examples for what data looks like and what I want to keep/discard.

processing FASTQ linux • 675 views
0
Entering edit mode

If you could post example data including UMI sequences and expected output, for few records, that would help in understanding the issue. You can use cutadapt to do the job.

0
Entering edit mode

I edited the post, thank you. Though I do not think cutadapt can do exactly this?

0
Entering edit mode

Try following code:

with seqkit:

$seqkit grep -srip ^.{6}A file.fq  with cutadapt, try this: $ cutadapt  -g ^"N{6}A" --discard-untrimmed --action=retain --quiet file.fq


Since data is PE, use appropriate command for PE. This is for SE.