Motif search on a DNA sequence in a gff file (ipdSummary output)
1
0
Entering edit mode
15 months ago
gabri ▴ 50

Hi All,

I used ipdSummary to identify the methylated adenines in my genome assembly. I got a huge gff file. In the seventh column there is a string called "context=nucleotide sequence". In each of these nucleotide sequences, the twenty-first base is always the methylated adenine.

Example:

s1  Cl  64  64  -   .   cv=80;context=ACGTTCTATGAATCCAATTG**A**CCGGTAATGCTTTCGAAGTC;ip=2
s1  Cl  68  68  -   .   cv=87;context=ACGTTCTATGAATCCACCGG**A**TTGATAATGCTTTCGAAGTC;ip=1
s1  Cl  71  71  +   .   cv=91;context=ACGTTCTATGAATCCACCTC**A**AGTTTAATGCTTTCGAAGTC;ip=3
s1  Cl  71  71  -   .   cv=76;context=ACGTTCTATGAATCCACCCA**A**GGTTTAATGCTTTCGAAGTC;ip=2
s1  Cl  74  74  +   .   cv=96;context=AACTATTGGACCAACGATGG**A**GGCCGTAGGTCTTAGTGTGT;ip=2
s1  Cl  74  74  -   .   cv=83;context=AACTATTGGACCAACGATGG**A**CCTCGTAGGTCTTAGTGTGT;ip=2
s1  Cl  76  76  -   .   cv=89;context=AACTATTGGACCAACGAAAA**A**AAAAGTAGGTCTTAGTGTGT;ip=2

I need to search and extract specific motifs in which the methylated adenine is included, possibly adding the motif in a new column to the same gff.

Motif list:

Can be the first or the fifth A, not the middle ones
ATTGA
ATCGA
ATAGA
AATGA
AACGA
AAAGA
AGTGA
AGCGA
AGAGA
ACTGA
ACCGA
ACAGA
1---5

Can be the second or the third A
CAAG
-23-

Can be only the second A
GAGG
-2--

Can be only the first A
ACCT
1---

No motif found
NoMotif

Output:

s1  Cl  64  64  -   .   cv=80;context=ACGTTCTATGAATCCA**ATTGA**CCGGTAATGCTTTCGAAGTC;ip=2    ATTGA
s1  Cl  68  68  -   .   cv=87;context=ACGTTCTATGAATCCACCGG**ATTGA**TAATGCTTTCGAAGTC;ip=1    ATTGA
s1  Cl  71  71  +   .   cv=91;context=ACGTTCTATGAATCCACCT**CAAG**TTTAATGCTTTCGAAGTC;ip=3    CAAG
s1  Cl  71  71  -   .   cv=76;context=ACGTTCTATGAATCCACC**CAAG**GTTTAATGCTTTCGAAGTC;ip=2    CAAG
s1  Cl  74  74  +   .   cv=96;context=AACTATTGGACCAACGATG**GAGG**CCGTAGGTCTTAGTGTGT;ip=2    GAGG
s1  Cl  74  74  -   .   cv=83;context=AACTATTGGACCAACGATGT**ACCT**CGTAGGTCTTAGTGTGT;ip=2    ACCT
s1  Cl  76  76  -   .   cv=89;context=AACTATTGGACCAACG**AAAAAAAAA**GTAGGTCTTAGTGTGT;ip=2    NoMotif

Thank you!

gff motif ipdSummary bash python • 379 views
ADD COMMENT
3
Entering edit mode
15 months ago
Joe 19k

This was a pretty fun one, if a little messy.

Try this:

#!/usr/bin/env python

import sys
import re
from itertools import chain

# This is not strictly necessary, but helps break the regex up over multiple lines
# so it's a bit easier to read:
a___a = "((\*\*)?A(\*\*)?[ATGC][TAC]G(\*\*)?A(\*\*)?)"
_aa_ = "(C(\*\*)?A(\*\*)?(\*\*)?A(\*\*)?G)"
_a__ = "(G\*\*A\*\*GG)"
a___ = "((\*\*)A(\*\*)CCT)"
regex = re.compile("|".join([a___a, _aa_, _a__, a___]))

with open(sys.argv[1], 'r') as fh:
    for line in fh:
        match = [x for x in list(chain(*regex.findall(line))) if "**A**" in x]
        print(line.rstrip("\n") + "\t" + (str(match[0]).replace('*','') if match else "NoMotif"))

There's an annoying amount of filtering to do because findall returns as list of tuples with a load of nonsense in them, but it works. Due to making all the **A**s optional, the regex does also match a string in a few examples with no ** - hence the need for additional filtering, but I think this is probably worthwhile to keep the regexes a bit more simple. There's probably a more elegant solution within the regex, but this is already at my limit pretty much!

$ python methylmotifs.py input.txt

s1  Cl  64  64  -   .   cv=80;context=ACGTTCTATGAATCCAATTG**A**CCGGTAATGCTTTCGAAGTC;ip=2    ATTGA
s1  Cl  68  68  -   .   cv=87;context=ACGTTCTATGAATCCACCGG**A**TTGATAATGCTTTCGAAGTC;ip=1    ATTGA
s1  Cl  71  71  +   .   cv=91;context=ACGTTCTATGAATCCACCTC**A**AGTTTAATGCTTTCGAAGTC;ip=3    CAAG
s1  Cl  71  71  -   .   cv=76;context=ACGTTCTATGAATCCACCCA**A**GGTTTAATGCTTTCGAAGTC;ip=2    CAAG
s1  Cl  74  74  +   .   cv=96;context=AACTATTGGACCAACGATGG**A**GGCCGTAGGTCTTAGTGTGT;ip=2    GAGG
s1  Cl  74  74  -   .   cv=83;context=AACTATTGGACCAACGATGG**A**CCTCGTAGGTCTTAGTGTGT;ip=2    ACCT
s1  Cl  76  76  -   .   cv=89;context=AACTATTGGACCAACGAAAA**A**AAAAGTAGGTCTTAGTGTGT;ip=2    NoMotif

I haven't gone as far as substituting in the **STRING** back in to the sequence field, but that might be a fun little task for you to try too.

At the moment this code will break/ignore cases where there are more than one real matches in any given sequence though. I assume based of your input data that isn't likely to ever be the case, but be aware. Wouldnt be difficult to modify the print statement though.

ADD COMMENT
0
Entering edit mode

Wow thank you so much for the help! It works! You really made my day. I will test it on much larger datasets. Thank you again!

ADD REPLY
0
Entering edit mode

Please do, just be aware that designing robust regexes is a bit of a dark art, so scrutinise your output data very carefully, since I know that RegEx is not as optimal as it perhaps could be.

If the answer solves your question, don't forget to toggle it as Accepted, by the way.

ADD REPLY

Login before adding your answer.

Traffic: 2110 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6