Question: How to identify contigs containing the degenerate primer sequences using python?
0
gravatar for xupbuy
11 days ago by
xupbuy0
xupbuy0 wrote:

I have a txt file with DNA sequences, each DNA sequence starting with ">Contig..." is called a contig. The file looks like this:

>Contig4679
CGGCGACGCCGGTGAGCCCACCGTTCCAGCGCAATGACAACAGCTGTAGCCCGCCCGAGA
GCGCCGTGAGGAACACGGCGGCGGGCACGAGGATGATGCGGCGGCCCACGGTCATGAGCA
CGATGGATGCGAACGAGGACAGCGCGATGCCGGTGGCGCCGAACGCCAGCACGCGCATGG
CGTCCACGCTCGGGTCATATTTGGGCAACAGGAGGTGCACCATCTCCGGGGCCCACACCG
CGCACAGCCCGGCCACGAGCGGCAACCCCACGGCGACGCCACGCACCAGTCGGTCGACGC
GCTCGCGGATCGCCGCGGGGTCCTGCCCGGCCTCGCTGTAGCGCTTCACCAACTGCGGGT
AGCTCACGTA
>Contig4680
ACCACTCACCCTACCACCTAGTCCTACAGCGTTATGTGGTTGGGCGGGTTGAGATGTTTT
TTAGAGACAACTCGAACTTCTCGCGCTGCTGGGCGGCTAAGTCTGGCTCCGCGTCGGCGA
GTTCGAGAAGCGCCAACTCGATCCGGTCGGCCGACAGCACGAGCTCCCGGGTCGGAATGA
GCTGCACCCGGTCGAGCGGCCGGATCGACCGCTGGTTCATGACGTCGAACTCGCGCATCG
ACAGGATCACGTCGTCGTCGATTTCCACGCGGATCGGATTCGGCGCCGAGGGGGGATAGA
CGGCGCTAATACACATCTCAGAGCCAACAAAAAAGGCAGAAACAACGAAACACATCCTCT
CCTAGAAAAA
>Contig4681
CACTCCTGCCGTCCCATCATCAGTAGCTCCTCGGGGGCGTAGGGCAACAGGGCGACGTTG
CGCAGGAAGAAGAGATAGGCGTCGCGGCCGACCGCGGTCTGCACCGGCATCGCGGCGACC
CGTGCGCTCAGCCACCCGCGGAAGCCTTCGAGCGCGGTCGCGGCACGATCGACCGCCGAG
TCGAGGCGCGGCTCTGCGTCGGCCGAGAGCCGCGGCTTCAGCTCGCGCGCCGTCTGCTTG
AGCCGCGGGCGGACGGTCTCGAGATCGGCGATGGCGAGCCGCGCAAATGAGCCGACCGCG
TCGGTGAGGTTGGCTTCCGCATGCTCCACCGTAATCGGGATGCGAGCGAGCTGTCTCTTA
TACACAACAC
>Contig4682
AGTCATGCTTGACGGTCGCTCTGTGGGTCAATTGGGGATATGCGCTCGTGCTCCTGGCTT
ATCCCCACGTTCTGCACAACACACGGCACGAGCAGTTCTCCGATGCGAAATTGCCCTACT
GCACGAGATGGATCTGACCTGCTACCGTTAACACATGGACACGCCCCTGACGCCGATGCC
ACCTGAAGCAGACGCGATTCGTGAAATCGCGCGCCTGCTCGTGGAGCAAGCCGAGGAAGC
GCTCCAGCGACACGACGCGCCTCTCCCGTAGCGAATCGCATTCGCGATCCCGGCCCTGTT
TTCTCGTTCTTTCAGAAAGGAGTCGACGTGTGTACGACAAAGAACTCCACGCGCGGAATC
GACTGCCCCG

I want to find out which contigs contain my degenerate primer sequences (and their reverse complementary sequences should also be considered) using python scripts, but don't know how. Any expert help me, please? Thank you so much!

I'd like some scripts that I can run like this:

Primer_finder.py -P1 GGRTCNCCIARYTGIGTICCIGTICCRTGIGC -P2 MGIGARGCIYTICARATGGAYCCICARCARMG -input contig.txt -output contigs_with_primer.txt

-P1 -P2: the degenerate primers. The output file should only contain contigs with both P1 and P2 (or their reverse complementary sequences).

Some explanation of the terms:

Degenerate primer sequences are the patterns to search. They are short DNA sequences, but with degeneracy. It means, normally DNA sequences contain A/G/C/T, but, for example, if there is an R, it means at this position, it can be either A or G. For example, AATRTGC means AATATGC or AATGTGC. Here's the table: R = A or G; Y = C or T; M = A or C; K = G or T; S = G or C; W = A or T; H = A or T or C; D = G or A or T; B = G or T or C; V = G or A or C; N = A or T or G or C.

Complementary sequences mean: A and T are complementary, G and C are complementary. Reverse complementary means first convert the sequences into complementary sequences and the reverse it, put the end to the front and put the front to the end. Example: ATTCCG reverse complementary: CGGAAT

python sequence assembly • 105 views
ADD COMMENTlink modified 11 days ago by shenwei3564.1k • written 11 days ago by xupbuy0

Partly match between PCR primer and target sequence is enough for PCR amplication. Do you still want only full match?

ADD REPLYlink written 11 days ago by shenwei3564.1k

(Only for full match)

# positive strand
$ cat seqs.fa | seqkit grep -s -d -p NACTGN | seqkit seq -n -i | tee 1.txt
Contig4679
Contig4682

# negative strand
$ cat seqs.fa | seqkit seq -r -p | seqkit grep -s -d -p TTAM | seqkit seq -n -i | tee 2.txt
Contig4681
Contig4682

# intersection
$ csvtk inter -H 1.txt 2.txt | tee id.txt
Contig4682

# retrieving sequences
$ seqkit grep -f id.txt seqs.fa > result.fa

ADD REPLYlink written 11 days ago by shenwei3564.1k

Thank you. What if I want to add the mismatch function? For example, '-mismatch 2' means it can tolerate 2 mismatches, continuous mismatch or non-continuous match. Thank you very much.

ADD REPLYlink written 10 days ago by xupbuy0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1542 users visited in the last hour