Question: How to confirm forward primer matches each read without removing primer sequence?
0
gravatar for gavinmdouglas
5.0 years ago by
Canada
gavinmdouglas10 wrote:

While filtering illumina MiSeq data I want to check whether the forward primer matches the beginning of each read (these are single-end reads). Lots of programs let you do this, but then they trim off the primer sequence. I'd actually like to keep the primer sequences in, but throw out any reads that don't have the primer sequence. 

*Edit: note that degenerate primers were used, so there are multiple sequences that could be matched. 

Does anyone know of a way to do this with fastq files? Ideally I'd like to use Trimmomatic's ILLUMINACLIP command, but I haven't figured out a way to stop it from clipping the sequence.

The mothur command "trim.seqs" does this for fasta files (when keepforward=T), but I'd rather not have to waste time converting between fastq and fasta since this is a pipeline I will need to re-run a lot.

Thanks in advance!

Gavin

sequencing filtering read • 3.8k views
ADD COMMENTlink modified 5.0 years ago by ashwini100 • written 5.0 years ago by gavinmdouglas10
2

Why not use one of the barcode demultiplexing tools, with the degenerate primer sequences as your barcodes? You could cat all of the matched reads together at the end.

Alternatively, I'm 95% certain that BBMap can do the trick. After all, it seems to do everything else ;-).

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.0 years ago by harold.smith.tarheel4.6k
2

As a matter of fact, it can!

e.g. for the adapter sequence GGACTGANNCGA

bbduk.sh in=reads.fq outm=matching.fq outu=nonmatching.fq literal=GGACTGANNCGA k=12 restrictleft=12 mm=f rcomp=f copyundefined

Here, literal is the literal sequence to look for; k should be set to the exact length of that sequence, restrictleft is optional but in this case will tell it to only look for matches in the first 12 bases (so it will not accept something where the primer is in the middle, for example); rcomp=f tells it to not look for reverse-complemented sequence; and copyundefined tells it to make copies of the literal sequence with all possible combinations of the degenerate bases.

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.0 years ago by Brian Bushnell17k
1

Thanks, this is perfect!

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.0 years ago by gavinmdouglas10

Aha, I knew it!

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.0 years ago by harold.smith.tarheel4.6k
2
gravatar for Jon
5.0 years ago by
Jon340
United States - US FS
Jon340 wrote:

Here is a potential python script. Use biopython to read in your FASTQ file, and then loop through each record and compare to your primer sequence. This is sort of lifted out of our UFITS package for processing NGS fungal amplicon sequences, where we do something similar, i.e. find primers and trim - but in your case you can just find them and print if match.

​import sys
from Bio import SeqIO

#create dictionary with IUPAC codes
IUPAC = {}
IUPAC['A'] = "A"
IUPAC['C'] = "C"
IUPAC['G'] = "G"
IUPAC['T'] = "T"
IUPAC['M'] = "AC"
IUPAC['R'] = "AG"
IUPAC['W'] = "AT"
IUPAC['S'] = "CG"
IUPAC['Y'] = "CT"
IUPAC['K'] = "GT"
IUPAC['V'] = "ACG"
IUPAC['H'] = "ACT"
IUPAC['D'] = "AGT"
IUPAC['B'] = "CGT"
IUPAC['X'] = "GATC"
IUPAC['N'] = "GATC"

def MatchLetter(a, b):
    global IUPAC
    try:
        sa = IUPAC[a.upper()]
    except:
        return False
    try:
        sb = IUPAC[b.upper()]
    except:
        return False
    for ca in sa:
        if ca in sb:
            return True
    return False

def MatchPrefix(Seq, Primer):
    L = len(Seq)
    n = len(Primer)
    if L < n:
        n = L
    Diffs = 0
    for i in range(0, n):
        if not MatchLetter(Seq[i], Primer[i]):
            Diffs += 1
    return Diffs

primer = 'AGGCAATTRCTWGGA'

handle = open(sys.argv[1], 'rU')
SeqRecords = SeqIO.parse(handle, 'fastq')
for rec in SeqRecords:
    Seq = str(rec.seq)
    Diffs = MatchPrefix(Seq, primer) #count diffs between Seq and primer
    if Diffs < 2:  #if Diffs less than some threshold, then just print record
        SeqIO.write(rec, sys.stdout, 'fastq')
ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.0 years ago by Jon340
1
gravatar for Renesh
5.0 years ago by
Renesh1.9k
United States
Renesh1.9k wrote:

As I understand your question correctly, you want to filter out the reads without primer sequences.

For this you can use, grep command on Linux/Unix to achieve it

grep your_primer_sequence fastq_file > output_file

Use grep --help to explore more options for grep command

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.0 years ago by Renesh1.9k

Thanks for your response. I should have mentioned that degenerate primers were used, so it is not only one primer sequence to scan for. Also, I want to make sure the primer matches at the beginning of the read (not just anywhere in the sequence) and of course the command as you have typed it would leave out the non-sequence lines of the fastq as well, so I don't think that would work.

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.0 years ago by gavinmdouglas10
1

You can loop over all sequences and modify the GREP extracting the line before the hit and then two following ones. While grepping the primer-sequence with '^', you'll get only those starting with it:

for i in $(cat primers.txt); do grep -A 2 -B 1 ^${i} myfastq.fq > $i.fastq ; done
ADD REPLYlink modified 11 months ago by RamRS30k • written 5.0 years ago by michael.ante3.6k

Okay, I don't think so any program will do like as per your requirement. here you have to write your own code.

ADD REPLYlink written 5.0 years ago by Renesh1.9k
1
gravatar for ashwini
5.0 years ago by
ashwini100
India
ashwini100 wrote:

The following command may help,

zcat infile.fastq.gz | paste - - - - | awk '{if ($0~/\tSEQUENCE/)print}' | awk -F '\t' '{print $1"\n"$2"\n"$3"\n"$4}' > outfile.fastq

The paste command with four hyphen makes four lines into one line separated by tab.

The second column in each line will be the sequence and tab precedes the sequence.

so \tSEQUENCE gives the fastq records which starts with primer sequence.

The last fastq print command prints back in the standard fastq format.

|| symbol can be used in the pattern matching step to include multiple primer sequences.

Hope it helps!

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.0 years ago by ashwini100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 867 users visited in the last hour