Question: Designing a primer from reads
1
gravatar for gbl1
11 months ago by
gbl170
gbl170 wrote:

Hello,

I have a lot of reads from an individual, and let's say I am very interested in a short sequence such as "ATCGATCG" I would like to design a primer from my reads that ends by it. NNNNNNNNNNNNATCGATCG and NNNNNNNNNNNNCGATCGAT.

How to find out what is the most present sequence in my read to design such a primer?

Thanks in advance

assembly genome • 483 views
ADD COMMENTlink modified 11 months ago by WouterDeCoster42k • written 11 months ago by gbl170

As you tagged, you should assemble the genome or target sequence region firstly, before designing primers.

Actually, I have no idea about what you want.

ADD REPLYlink written 11 months ago by shenwei3564.9k

Why not just order a library of pseudo-degenerate primers from your supplier? You can order primers which have the sequence nnnnnnnnATCGATCG, which will amplify throughout the genome, if that's what you want.

It's not very clear at the moment what the purpose is, but doing this from reads seems likely to lead to an enormous number of primer sequences - I agree with assembly.

ADD REPLYlink written 11 months ago by Joe15k

Hi jrj.healey and shenwei356

If I had enough reads to assemble the genome, I would not be in need of having more reads or design any primer for it… If the genome was assemblable, I'll do a simple in silico study.

As I tried to patent warm water, and was told someone already published it, let me explain it better:

8 bases oligo should be found in many place in the reads, from all the occurences, there's most likely one in which one of the base before is proheminent

of those 9 bases oligo, , from all the occurences, there's most likely one in which one of the base is proheminent …

bla bla bla until 20 base

degenarate is not a good option, for there's too many solution at the end and I would get too many bullshit if I get a primer with 16 degenerated bases … and to do that, I would not need any reads in first instance…

ADD REPLYlink written 11 months ago by gbl170

It sounds like you are trying to close gaps in your genome in that case?

For what its worth, if your data is so poor that assembly doesn't yield any improvement, you have bigger problems and should start over.

This seems a lot like an XY problem to me, so perhaps explain what you're actually trying to investigate?

ADD REPLYlink written 11 months ago by Joe15k

Idea might be to take a sequence signal that is specific of coding region, then having some primer that might would fit them with S3AP… for basically, I'm working with plants, octoploid species, a normal illumina read of the genome cover just one third, half of the genome is gypsy and 10% copia… so having an idea of what some coding region could be might help in making it more assemblable. please consider it is an orphan crop, and that I would never have enough budget to get enough random reads to make an assembly, but some targeted reads could help in completed it.

ADD REPLYlink written 11 months ago by gbl170

You should still attempt to assemble your reads, as you do not know where the gaps are going to be. You may end up designing primers unnecessarily.

I think you're underestimating how many primers this approach could yield.

Am I to understand that you are trying to identify a specific gene (or partial gene) within your reads?

ADD REPLYlink written 11 months ago by Joe15k

isn't this in principle looking for "ATCGATCG" motif as all the previous bases can be any base? If so, you would have lots of amplification. @ gbl1

ADD REPLYlink written 11 months ago by cpad011212k

So, the "ATCGATCG" might appear either on a random way, or in a more or less conserved sequence. I try to target the "conserved" sequence, and then need to figure out what it is…

ADD REPLYlink written 11 months ago by gbl170

I really don’t think your approach is going to work the way you think it will, nor will it solve the problem as you expect...

ADD REPLYlink written 11 months ago by Joe15k

You have no idea what "the problem" is… Neither if I actually have a problem…

ADD REPLYlink written 11 months ago by gbl170
2

I understand this might be a language barrier, but this thread is dangerously close to becoming personal. Let's keep the discussion to just the science please.

ADD REPLYlink written 11 months ago by RamRS25k

Sorry for sounding rude, but I sound I need to explain a point again, even if I have already told it in this tread:

  • I don't try to assemble the genome
  • I don't have enough read to assemble the genome
  • I cannot afford enough read to assemble the genome
  • I don't need to assemble the genome for my experimental purpose

basically, there's 2 possible ways to get a satisfying result:

For all ATCGATCG sequence, the previous base is more often, let's says G
For all GATCGATCG sequence, the previous base is more often, bla bla bla…

OR

in For ATCGATCG sequences the proportion of each bases at each position from it are …

Even if, when I was a kid, when I asked to my dad why such a thing does not exist, he answered "because Einstein left you things to discover" I sure someone tried such a thing before.

ADD REPLYlink modified 11 months ago by RamRS25k • written 11 months ago by gbl170

Do you have reads in fastq, fasta or bam format? Do you only want one primer, or a list of sequences and how often they occur? How long should this primer be?

ADD REPLYlink written 11 months ago by WouterDeCoster42k

I think the original format is fastq, but I also have them in fasta, and even in a blast DB

Primers should be around 22 nucleotides

I would like one primer each side (it may have some degenerations, but not 16 N ;)

The "CGATCGAT" exemple was a random one, there's several short sequences I might look for!

ADD REPLYlink written 11 months ago by gbl170

Hi,

My message box gave me an answer:

18 hours, 15 minutes ago: genomax wrote on C: Designing a primer from reads : Only way you are going to be able to do this is to align all reads just at ATCGATCG (ignoring bases around this sequence in alignment). Then parse the columnar positions in that alignment to get the information you want.

That indeed look like a good way of doing it… I can't find it in the tread, and don't know how to PM someone here… So can anyone ask genomax what is the best way of doing that? Advising me a soft that would do that?

ADD REPLYlink written 11 months ago by gbl170

Please don't add answers unless you're answering the top level question. Use comments or replies instead. I'm moving both of your "answers" to comments.

ADD REPLYlink written 11 months ago by RamRS25k

Sorry, I though I was commenting…

ADD REPLYlink written 11 months ago by gbl170

You can join the biostars slack to pm users

ADD REPLYlink written 11 months ago by WouterDeCoster42k

gbl1 : I had added that comment yesterday but deleted it later since I was only outlining logic one could use to do what you wanted. I don't recollect any software off the top of my head that would allow you to align a specific chunk of sequence in all reads ignoring sequence around that core but it certainly may exist out there.

This can also be done by writing a custom program which can create a dictionary of bases in positions around that core sequence which you can then use (only in cases where ATCGATCG is more or less in the middle edit: I see now that in original question this sequnce needs to be at the end) to get the info you want.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax75k

At least you understood the logic I was seeking for ;) For sure it might exist… I can't code that myself… I'm not that good in programming, I'm mostly a molecular biologist…

ADD REPLYlink written 11 months ago by gbl170

How many reads do you have? You could take a sample of a couple hundred and try doing this in MEGA. You may have to play with gap opening penalties (and other settings) to see if you can force the program from opening too many gaps when doing the alignment. You may be able to highlight the sequence you want and say don't put any gaps here.

Note: Use your fasta files with MEGA.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax75k

I have got from memory 30Gb of reads, this cover 1/3 of unique region

I'll look at Mega ;) thanks for the hint

ADD REPLYlink written 11 months ago by gbl170

I think you should look at @Wouter's answer below and not worry about MEGA (with 30G of reads). We are not sure what it is you are trying to achieve but after trying the following you can tell us if the solution is going in the right direction.

ADD REPLYlink written 11 months ago by genomax75k
2
gravatar for WouterDeCoster
11 months ago by
Belgium
WouterDeCoster42k wrote:

I believe the code below currently answers the question. I am not sure if it is technically or biologically sane. I don't know how you got these patterns and how exactly making primers from these is going to help you solve your assembly, but that's not my job ;-)

What happens:

  1. Reads are searched in a fastq.gz [--fastq] file with an exact match to (a) given pattern(s) [--pattern]
  2. If pattern is found a sequence of length 22 is extracted (adjustable parameter in the script)
  3. All those sequences which are found are counted and sorted, reporting the 20 most frequent strings (adjustable parameter in the script). Results are printed to stdout.

What does not happen:

  • Looking at reverse complement or strand
  • Allowing nucleotide mismatches
  • Incorporating degenerate nucleotides

This script was only limited tested on a small dataset. I don't know about the size of your dataset and the RAM you have available. You will need python and biopython. Save the code as pattern2primer.py or something like that and execute as e.g. python pattern2primer.py --fastq myreads.fastq.gz --pattern ATCGAGAAG

Let me know if you have additional specifications.

ADD COMMENTlink written 11 months ago by WouterDeCoster42k

looks lovely! I'm going to try this as soon as possible!!!

ADD REPLYlink written 11 months ago by gbl170

Hi,

So, I tried on my first set of reads, it has worked perfectly... I will need to sort that out, but it is great!

I tried on my second set, I had a bug:

Traceback (most recent call last):                                                                                                                                                                                     
  File "pattern2primer.py", line 33, in <module>                                                                                                                                                                       
    main()                                                                                                                                                                                                             
  File "pattern2primer.py", line 11, in main                                                                                                                                                                           
    for read in SeqIO.parse(gzip.open(args.fastq, 'rt'), "fastq"):                                                                                                                                                     
  File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 582, in parse                                                                                                                                    
    for r in i:                                                                                                                                                                                                        
  File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator                                                                                                                     
    for title_line, seq_string, quality_string in FastqGeneralIterator(handle):                                                                                                                                        
  File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 912, in FastqGeneralIterator                                                                                                                    
    seq_string = handle_readline().rstrip()
  File "/usr/lib/python2.7/gzip.py", line 455, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 261, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 308, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 347, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x485243c5 != 0x75682b8aL
beleduc@dx3-biotek67:~$
ADD REPLYlink modified 11 months ago by genomax75k • written 11 months ago by gbl170

That error is raised by gzip and biopython, not anything I wrote. My first guess is that your input file is corrupt.

ADD REPLYlink written 11 months ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 700 users visited in the last hour