How to confirm forward primer matches each read without removing primer sequence?
3
0
Entering edit mode
7.1 years ago

While filtering illumina MiSeq data I want to check whether the forward primer matches the beginning of each read (these are single-end reads). Lots of programs let you do this, but then they trim off the primer sequence. I'd actually like to keep the primer sequences in, but throw out any reads that don't have the primer sequence.

*Edit: note that degenerate primers were used, so there are multiple sequences that could be matched.

Does anyone know of a way to do this with fastq files? Ideally I'd like to use Trimmomatic's ILLUMINACLIP command, but I haven't figured out a way to stop it from clipping the sequence.

The mothur command trim.seqs does this for fasta files (when keepforward=T), but I'd rather not have to waste time converting between fastq and fasta since this is a pipeline I will need to re-run a lot.

Gavin

filtering sequencing read • 5.5k views
2
Entering edit mode

Why not use one of the barcode demultiplexing tools, with the degenerate primer sequences as your barcodes? You could cat all of the matched reads together at the end.

Alternatively, I'm 95% certain that BBMap can do the trick. After all, it seems to do everything else ;-).

2
Entering edit mode

As a matter of fact, it can!

e.g. for the adapter sequence GGACTGANNCGA

bbduk.sh in=reads.fq outm=matching.fq outu=nonmatching.fq literal=GGACTGANNCGA k=12 restrictleft=12 mm=f rcomp=f copyundefined


Here, literal is the literal sequence to look for; k should be set to the exact length of that sequence, restrictleft is optional but in this case will tell it to only look for matches in the first 12 bases (so it will not accept something where the primer is in the middle, for example); rcomp=f tells it to not look for reverse-complemented sequence; and copyundefined tells it to make copies of the literal sequence with all possible combinations of the degenerate bases.

1
Entering edit mode

Thanks, this is perfect!

0
Entering edit mode

Aha, I knew it!

2
Entering edit mode
7.1 years ago
Jon ▴ 360

Here is a potential python script. Use biopython to read in your FASTQ file, and then loop through each record and compare to your primer sequence. This is sort of lifted out of our UFITS package for processing NGS fungal amplicon sequences, where we do something similar, i.e. find primers and trim - but in your case you can just find them and print if match.

​import sys
from Bio import SeqIO

#create dictionary with IUPAC codes
IUPAC = {}
IUPAC['A'] = "A"
IUPAC['C'] = "C"
IUPAC['G'] = "G"
IUPAC['T'] = "T"
IUPAC['M'] = "AC"
IUPAC['R'] = "AG"
IUPAC['W'] = "AT"
IUPAC['S'] = "CG"
IUPAC['Y'] = "CT"
IUPAC['K'] = "GT"
IUPAC['V'] = "ACG"
IUPAC['H'] = "ACT"
IUPAC['D'] = "AGT"
IUPAC['B'] = "CGT"
IUPAC['X'] = "GATC"
IUPAC['N'] = "GATC"

def MatchLetter(a, b):
global IUPAC
try:
sa = IUPAC[a.upper()]
except:
return False
try:
sb = IUPAC[b.upper()]
except:
return False
for ca in sa:
if ca in sb:
return True
return False

def MatchPrefix(Seq, Primer):
L = len(Seq)
n = len(Primer)
if L < n:
n = L
Diffs = 0
for i in range(0, n):
if not MatchLetter(Seq[i], Primer[i]):
Diffs += 1
return Diffs

primer = 'AGGCAATTRCTWGGA'

handle = open(sys.argv[1], 'rU')
SeqRecords = SeqIO.parse(handle, 'fastq')
for rec in SeqRecords:
Seq = str(rec.seq)
Diffs = MatchPrefix(Seq, primer) #count diffs between Seq and primer
if Diffs < 2:  #if Diffs less than some threshold, then just print record
SeqIO.write(rec, sys.stdout, 'fastq')

1
Entering edit mode
7.1 years ago
Renesh ★ 2.1k

As I understand your question correctly, you want to filter out the reads without primer sequences.

For this you can use, grep command on Linux/Unix to achieve it

grep your_primer_sequence fastq_file > output_file


Use grep --help to explore more options for grep command

0
Entering edit mode

Thanks for your response. I should have mentioned that degenerate primers were used, so it is not only one primer sequence to scan for. Also, I want to make sure the primer matches at the beginning of the read (not just anywhere in the sequence) and of course the command as you have typed it would leave out the non-sequence lines of the fastq as well, so I don't think that would work.

1
Entering edit mode

You can loop over all sequences and modify the GREP extracting the line before the hit and then two following ones. While grepping the primer-sequence with '^', you'll get only those starting with it:

for i in $(cat primers.txt); do grep -A 2 -B 1 ^${i} myfastq.fq > $i.fastq ; done  ADD REPLY 0 Entering edit mode Okay, I don't think so any program will do like as per your requirement. here you have to write your own code. ADD REPLY 1 Entering edit mode 7.1 years ago ashwini ▴ 100 The following command may help, zcat infile.fastq.gz | paste - - - - | awk '{if ($0~/\tSEQUENCE/)print}' | awk -F '\t' '{print $1"\n"$2"\n"$3"\n"$4}' > outfile.fastq


The paste command with four hyphen makes four lines into one line separated by tab.

The second column in each line will be the sequence and tab precedes the sequence.

so \tSEQUENCE gives the fastq records which starts with primer sequence.

The last fastq print command prints back in the standard fastq format.

|| symbol can be used in the pattern matching step to include multiple primer sequences.

Hope it helps!