Hello everybody, I have a question regarding processing raw FASTQ files based on a specific UMI approach.
Basically, we employed a strategy to our paired-end sequencing experiment, where we use 6nt UMIs in our library. Following the UMI sequence is a fixed base, that is the same for every R1 and every R2 (different between R1 and R2 of course).
This results in most reads having the same base at position 7 (confirmed with FastQC).
An example of a read I want to keep is this:
@NB551018:503:HV5CTAFX2:1:11101:9572:1046 1:N:0:ACNTGA
TTATTNACTAGCTGCGTTCTTCATCGACGCACGAGCCGAG
+
A/AA/#EEEEEEAEAEEEEEEEEEAEEEEEAEE//EEEEE
Here the 7th base is an A, which marks this as a read that is to be kept.
An example of read to be discarded is this:
@NB551018:503:HV5CTAFX2:1:11101:14783:1047 1:N:0:ACNTGA
GCGACNCTATCCACCCAAAGGATAAACATTTATCATACCA
+
AAA/A#EAEEEEEEEA<AEEEEEAEEAAEEEE/AEEEEAE
The 7th base here is a C, meaning I want to discard this read.
My question now is: How do I remove all reads that do not meet the condition of this "fixed 7th base"? Is there a good way to do this in Linux (maybe with grep, awk...?) I am not yet very well versed in using Linux and it's built in tool for manipulating and processing files, hence the question.
Or ist there a specialized tool, that can do this, that I am not aware of?
I am very grateful for any help!
-Chris
edited: Added examples for what data looks like and what I want to keep/discard.
If you could post example data including UMI sequences and expected output, for few records, that would help in understanding the issue. You can use cutadapt to do the job.
I edited the post, thank you. Though I do not think cutadapt can do exactly this?
Try following code:
with seqkit:
with cutadapt, try this:
Since data is PE, use appropriate command for PE. This is for SE.