Python Regular Expressions to extract the transcript ID from lines of GTF file
5
0
Entering edit mode
7.1 years ago
oars ▴ 200

I'm sure there is a smart and simplistic way to do this but I'm stuck. I simply want to extract the transcript_id field using the import re (re.findall) from the lines in the GTF (FASTA homo_sapiens). This is what I have so far:

import re
f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
# Feed the files into findall(); it returns a list of all the found strings
string = re.findall(r'transcript_id, f.read())
     print transcript_id

Where did I go wrong?

python transcript_id GTF • 5.8k views
ADD COMMENT
2
Entering edit mode
7.1 years ago

It turns out to be really hard (and error-prone) to come up with a good regular expression to parse that out. In deepTools, I use the following process:

  1. Split each line into columns by tab (cols = line.strip().split("\t")
  2. Use the csv module to parse the last column (s = next(csv.reader([cols[8]], delimiter=' '))).
  3. Get the column after transcript_id (s[s.index('transcript_id') + 1].rstrip(";"))

You can see an example of that here, which is the python module I wrote for deepTools to read BED/GTF files into a custom interval tree.

ADD COMMENT
0
Entering edit mode

Actually I do like this approach, thank you to share Devon. Your solution is a bit slower than regex, but definitively safer. Thanks!

ADD REPLY
0
Entering edit mode

Hey Ryan If I want to parse the GRCH38 or 37 transcripts (model transcripts) just from chr22, to use them in kallisto analisys for reads quantifications.

I could use your deeptools calling which functions? Paulo

ADD REPLY
0
Entering edit mode

I'd use awk instead, it'd be easier for filtering.

ADD REPLY
0
Entering edit mode
7.1 years ago
oars ▴ 200

Many thanks Devon! Is it even possible to parse out transcript_id's with a regular expression (re.match, re.search, or re.findall)?

ADD COMMENT
0
Entering edit mode

Please note the "ADD COMMENT" button.

To your question, perhaps? I never got one to work with all of my weird test cases, but I suspect that's simply because I didn't try hard enough. The method I posted works with every weird case I could come up with, so I just left it at that.

ADD REPLY
0
Entering edit mode
7.1 years ago
oars ▴ 200

I found this code on stack exchange;

> if re.findall(r'transcript_id=[^\s]+',line):
> transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

> else:

>   transcript = "NA"

I don't understand the [^\s]+',line): portion of the code - what does this section do?

Would this work?

f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
>>> if re.findall(r'transcript_id=[^\s]+',line):
>>>transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

...else:

>   transcript = "NA"
ADD COMMENT
0
Entering edit mode

That won't work. transcript_id=[^\s]+ will find all instances of transcript_id= followed by white-space. The \s should presumably be a \S (so, non-whitespace). Regardless, that won't work because (1) transcript IDs can contain white space and (2) there are no transcript_id= instances in a GTF file. That's the annoying part about transcript IDs (and gene names and such), they can contain anything, including quotes and delimiters.

ADD REPLY
0
Entering edit mode
7.1 years ago
glihm ▴ 660

First, the @Devon Ryan solution is something you have to consider as he already spent time to wrote and check this code. He is using the CSV module to ensure that the split is correctly done (taking in account the double quote or no, etc...).

If you still with your idea of Python REGEX, for sure it's possible.

1) Revise the format you want to parse: GTF (link to format description) 2) You want to extract an information from the last column (attribute column), which can be composed by several tags, separated by a semi-column (';'). 3) Now you now want you want, you can use the Python REGEX to build the extraction code from the pseudocode like "{transcript_id}{\space}{id_value};", which gives you in python the following regex:

"transcript_id\s([^;]+);?"

where:

transcript_id => the tag you want to find

\s => represents a space, as in GTF format, the attributes are like this: "TAG{SPACE}VALUE;

([^;]+) => We want to extract ALL the characters EXCEPT the semi-column as the semi-column is the attribute separator. Also, we use the parenthesis to tell to Python "I want to extract this information". In the version you pasted, they use [^\s] to say that spaces are not allowed in the ID.

;? => In REGEX, using '?' tells that I don't know if this character is present. Sometimes, the end of the GTF line doesn't contain this semi-column (which is mandatory). So, with ";?", python will check if the ';' if found or not (for the last field).

So, from this piece of information, you can write a python function to extract the transcript_id from a gff line:

#!usr/bin/env python3                                                                                                                                                                                                                        

# Stdlib Python3                                                                                                                                                                                                                             
import re

# Constants                                                                                                                                                                                                                                  
GFF_SEP = "\t"

# Create a function to extract the transcript_id from a gff line.                                                                                                                                                                            
# We assume that there is only ONE transcript_id per line.                                                                                                                                                                                   
def get_transcript_id(gffline):
    """ Returns the transcript_id (str) in the gffline.                                                                                                                                                                                      
    None if transcript_id not found.                                                                                                                                                                                                         
    """
    # We first extract the attribute field (the last one) as you know that your tag is here
    attribute_field = gffline.strip().split(GFF_SEP)[-1]
    regex_pattern = re.compile("transcript_id\s([^;]+);?")
    regex_results = regex_pattern.search(attribute_field)

    try:
        return regex_results.group(1)
    except AttributeError:
        return None


def main():
    """                                                                                                                                                                                                                                      
    """
    one_gff_line = """1\tprocessed_transcript\ttranscript\t11869\t14409\t.\t+\t.\tgene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";"""

    tr_id = get_transcript_id(one_gff_line)
    if tr_id:
        print("transcript_id extracted = %s" % (tr_id))
    else:
        print("transcript_id tag not found.")

if __name__ == "__main__":
    main()

And as you can see, this function can easily be extended to match any tag you want, by changing the function signature adding an other parameter: "tag_name". ;)

I hope this helps and feel free to ask if something is not clear.

ADD COMMENT
0
Entering edit mode

I hope there are no transcript IDs (or anything else that you want to extract) that contain a semi-colon...

ADD REPLY
0
Entering edit mode

That's a good comment, usually that's right I don't have any ID with ';' inside them. That's right, I was reviewing your code and it's more general as you don't have to care about the ID content. :)

ADD REPLY
0
Entering edit mode
7.1 years ago

One comment only (I cannot add anything to what was already discussed here above):

GTF format files SHOULD have, in field 9, first "gene_id" and then "transcript_id". If the file format is respected, which you should assume so in the first place, doing a simple:

line.rstrip("\b\r\n").split("\t")[8].split("; ")[1].split(" ")[1].strip("\"")

Will ensure that, from each line, you extract the transcript_id because it should always be the second item of the 9th field, after gene_id.

Of course, if you don't trust the format to be respected, using re.search might be a good choice.

ADD COMMENT
1
Entering edit mode

Note that the = is particular to GFF and isn't present in GTF.

ADD REPLY
1
Entering edit mode

Good catch. Edited.

ADD REPLY

Login before adding your answer.

Traffic: 1563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6