I have a txt file with DNA sequences, each DNA sequence starting with ">Contig..." is called a contig. The file looks like this:
>Contig4679 CGGCGACGCCGGTGAGCCCACCGTTCCAGCGCAATGACAACAGCTGTAGCCCGCCCGAGA GCGCCGTGAGGAACACGGCGGCGGGCACGAGGATGATGCGGCGGCCCACGGTCATGAGCA CGATGGATGCGAACGAGGACAGCGCGATGCCGGTGGCGCCGAACGCCAGCACGCGCATGG CGTCCACGCTCGGGTCATATTTGGGCAACAGGAGGTGCACCATCTCCGGGGCCCACACCG CGCACAGCCCGGCCACGAGCGGCAACCCCACGGCGACGCCACGCACCAGTCGGTCGACGC GCTCGCGGATCGCCGCGGGGTCCTGCCCGGCCTCGCTGTAGCGCTTCACCAACTGCGGGT AGCTCACGTA >Contig4680 ACCACTCACCCTACCACCTAGTCCTACAGCGTTATGTGGTTGGGCGGGTTGAGATGTTTT TTAGAGACAACTCGAACTTCTCGCGCTGCTGGGCGGCTAAGTCTGGCTCCGCGTCGGCGA GTTCGAGAAGCGCCAACTCGATCCGGTCGGCCGACAGCACGAGCTCCCGGGTCGGAATGA GCTGCACCCGGTCGAGCGGCCGGATCGACCGCTGGTTCATGACGTCGAACTCGCGCATCG ACAGGATCACGTCGTCGTCGATTTCCACGCGGATCGGATTCGGCGCCGAGGGGGGATAGA CGGCGCTAATACACATCTCAGAGCCAACAAAAAAGGCAGAAACAACGAAACACATCCTCT CCTAGAAAAA >Contig4681 CACTCCTGCCGTCCCATCATCAGTAGCTCCTCGGGGGCGTAGGGCAACAGGGCGACGTTG CGCAGGAAGAAGAGATAGGCGTCGCGGCCGACCGCGGTCTGCACCGGCATCGCGGCGACC CGTGCGCTCAGCCACCCGCGGAAGCCTTCGAGCGCGGTCGCGGCACGATCGACCGCCGAG TCGAGGCGCGGCTCTGCGTCGGCCGAGAGCCGCGGCTTCAGCTCGCGCGCCGTCTGCTTG AGCCGCGGGCGGACGGTCTCGAGATCGGCGATGGCGAGCCGCGCAAATGAGCCGACCGCG TCGGTGAGGTTGGCTTCCGCATGCTCCACCGTAATCGGGATGCGAGCGAGCTGTCTCTTA TACACAACAC >Contig4682 AGTCATGCTTGACGGTCGCTCTGTGGGTCAATTGGGGATATGCGCTCGTGCTCCTGGCTT ATCCCCACGTTCTGCACAACACACGGCACGAGCAGTTCTCCGATGCGAAATTGCCCTACT GCACGAGATGGATCTGACCTGCTACCGTTAACACATGGACACGCCCCTGACGCCGATGCC ACCTGAAGCAGACGCGATTCGTGAAATCGCGCGCCTGCTCGTGGAGCAAGCCGAGGAAGC GCTCCAGCGACACGACGCGCCTCTCCCGTAGCGAATCGCATTCGCGATCCCGGCCCTGTT TTCTCGTTCTTTCAGAAAGGAGTCGACGTGTGTACGACAAAGAACTCCACGCGCGGAATC GACTGCCCCG
I want to find out which contigs contain my degenerate primer sequences (and their reverse complementary sequences should also be considered) using python scripts, but don't know how. Any expert help me, please? Thank you so much!
I'd like some scripts that I can run like this:
Primer_finder.py -P1 GGRTCNCCIARYTGIGTICCIGTICCRTGIGC -P2 MGIGARGCIYTICARATGGAYCCICARCARMG -input contig.txt -output contigs_with_primer.txt
-P1 -P2: the degenerate primers. The output file should only contain contigs with both P1 and P2 (or their reverse complementary sequences).
Some explanation of the terms:
Degenerate primer sequences are the patterns to search. They are short DNA sequences, but with degeneracy. It means, normally DNA sequences contain A/G/C/T, but, for example, if there is an R, it means at this position, it can be either A or G. For example, AATRTGC means AATATGC or AATGTGC. Here's the table: R = A or G; Y = C or T; M = A or C; K = G or T; S = G or C; W = A or T; H = A or T or C; D = G or A or T; B = G or T or C; V = G or A or C; N = A or T or G or C.
Complementary sequences mean: A and T are complementary, G and C are complementary. Reverse complementary means first convert the sequences into complementary sequences and the reverse it, put the end to the front and put the front to the end. Example: ATTCCG reverse complementary: CGGAAT