Question: how to identify degenerate primers in reads
0
gravatar for genya35
8 months ago by
genya3520
genya3520 wrote:

Hello,

I have a two adapter trimmed Illumina fastq files (read1 and read2). I would like to identify the primer sequence. I have some idea what the primer sequences are however, they appears to be degenerate. So far I've found over 200 variations for the primer sequence in reads and looking for a tool that would group them and produce degenerate primer sequence.

Next, I plan to separate the reads into separate files based on the primer sequence using bbtools.

Please suggest an approach/tool to use.

next-gen • 358 views
ADD COMMENTlink modified 4 months ago by Biostar ♦♦ 20 • written 8 months ago by genya3520

Can you give some details on what you mean by "primer sequence". What kind of assay are you using?

ADD REPLYlink written 8 months ago by ATpoint25k

it's an amplicon based assay. By primer sequence I mean "NNCTGGGTCCGCCAAGCT". Thanks

ADD REPLYlink written 8 months ago by genya3520

I have a bunch of sequences that I'm trying to figure out at which base they are ambiguous. I wonder if there is a tool that can help with that.

AAGGATCCGGCAGCCCCC
ACGGATCCGGCAGCCCCC
AGGGATCCGGCAGCCCCC
ATGGATCCGGCAGCCCCC
CAGGATCCGGCAGCCCCC
CGGGATCCGGCAGCCCCC
GAGGATCCGGCAGCCCCC
GCGGATCCGGCAGCCCCC
ADD REPLYlink written 8 months ago by genya3520
1

@Brian had provided this solution in past. Change the number 20 depending on size of primer you expect.

Following will trim all but first 20 bases (I assume your primer is at the beginning of read)?

$ reformat.sh in=reads.fq out=trimmed.fq ftr=19

This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort (in excel if you wish)

$ kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f

If primers are 20 bp long they should be easily identifiable.

ACCGTTACCGTTACCGTTAC    100
AAATTTTTTTCCCCCCCCCC    85

If this works let me know and I will move it to an answer.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax74k

This was really helpful, however, I ended up with 1560 unique sequences. Is there a tool that would group them based on the similarity?

ADD REPLYlink written 8 months ago by genya3520

Use unix sort to sort the first column containing the sequence. That said you did not get a few clear contenders (with lots of counts) as compared to rest? Are you explaining what you need accurately?

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax74k

@genomax There is too much variation in the first 3 bases of each read. Is there a way to tell reformat.sh to ignore the first three bases and look at the base 4 to 23? thanks

ADD REPLYlink written 8 months ago by genya3520

You can add ftl=2 to remove first three bases and extend ftr=22 to still get 20 bases total in first reformat.sh command above. Then follow the same process.

Just in case both those options don't work right together you could do something like (two consecutive rounds of trimming)

reformat.sh in=reads.fq out=stdout.fq ftr=22 | reformat.sh in=stdin.fq out=trimmed.fq ftl=2
ADD REPLYlink modified 8 months ago • written 8 months ago by genomax74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 937 users visited in the last hour