Question: Python Regular Expressions to extract the transcript ID from lines of GTF file
0
gravatar for oars
2.2 years ago by
oars160
oars160 wrote:

I'm sure there is a smart and simplistic way to do this but I'm stuck. I simply want to extract the transcript_id field using the import re (re.findall) from the lines in the GTF (FASTA homo_sapiens). This is what I have so far:

import re
f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
# Feed the files into findall(); it returns a list of all the found strings
string = re.findall(r'transcript_id, f.read())
     print transcript_id

Where did I go wrong?

gtf python transcript_id • 2.2k views
ADD COMMENTlink modified 2.2 years ago by Macspider3.0k • written 2.2 years ago by oars160
2
gravatar for Devon Ryan
2.2 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

It turns out to be really hard (and error-prone) to come up with a good regular expression to parse that out. In deepTools, I use the following process:

  1. Split each line into columns by tab (cols = line.strip().split("\t")
  2. Use the csv module to parse the last column (s = next(csv.reader([cols[8]], delimiter=' '))).
  3. Get the column after transcript_id (s[s.index('transcript_id') + 1].rstrip(";"))

You can see an example of that here, which is the python module I wrote for deepTools to read BED/GTF files into a custom interval tree.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Devon Ryan92k

Actually I do like this approach, thank you to share Devon. Your solution is a bit slower than regex, but definitively safer. Thanks!

ADD REPLYlink written 2.2 years ago by glihm610

Hey Ryan If I want to parse the GRCH38 or 37 transcripts (model transcripts) just from chr22, to use them in kallisto analisys for reads quantifications.

I could use your deeptools calling which functions? Paulo

ADD REPLYlink written 4 months ago by psschlogl20

I'd use awk instead, it'd be easier for filtering.

ADD REPLYlink written 4 months ago by Devon Ryan92k
0
gravatar for oars
2.2 years ago by
oars160
oars160 wrote:

Many thanks Devon! Is it even possible to parse out transcript_id's with a regular expression (re.match, re.search, or re.findall)?

ADD COMMENTlink written 2.2 years ago by oars160

Please note the "ADD COMMENT" button.

To your question, perhaps? I never got one to work with all of my weird test cases, but I suspect that's simply because I didn't try hard enough. The method I posted works with every weird case I could come up with, so I just left it at that.

ADD REPLYlink written 2.2 years ago by Devon Ryan92k
0
gravatar for oars
2.2 years ago by
oars160
oars160 wrote:

I found this code on stack exchange;

> if re.findall(r'transcript_id=[^\s]+',line):
> transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

> else:

>   transcript = "NA"

I don't understand the [^\s]+',line): portion of the code - what does this section do?

Would this work?

f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
>>> if re.findall(r'transcript_id=[^\s]+',line):
>>>transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

...else:

>   transcript = "NA"
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by oars160

That won't work. transcript_id=[^\s]+ will find all instances of transcript_id= followed by white-space. The \s should presumably be a \S (so, non-whitespace). Regardless, that won't work because (1) transcript IDs can contain white space and (2) there are no transcript_id= instances in a GTF file. That's the annoying part about transcript IDs (and gene names and such), they can contain anything, including quotes and delimiters.

ADD REPLYlink written 2.2 years ago by Devon Ryan92k
0
gravatar for glihm
2.2 years ago by
glihm610
France
glihm610 wrote:

First, the @Devon Ryan solution is something you have to consider as he already spent time to wrote and check this code. He is using the CSV module to ensure that the split is correctly done (taking in account the double quote or no, etc...).

If you still with your idea of Python REGEX, for sure it's possible.

1) Revise the format you want to parse: GTF (link to format description) 2) You want to extract an information from the last column (attribute column), which can be composed by several tags, separated by a semi-column (';'). 3) Now you now want you want, you can use the Python REGEX to build the extraction code from the pseudocode like "{transcript_id}{\space}{id_value};", which gives you in python the following regex:

"transcript_id\s([^;]+);?"

where:

transcript_id => the tag you want to find

\s => represents a space, as in GTF format, the attributes are like this: "TAG{SPACE}VALUE;

([^;]+) => We want to extract ALL the characters EXCEPT the semi-column as the semi-column is the attribute separator. Also, we use the parenthesis to tell to Python "I want to extract this information". In the version you pasted, they use [^\s] to say that spaces are not allowed in the ID.

;? => In REGEX, using '?' tells that I don't know if this character is present. Sometimes, the end of the GTF line doesn't contain this semi-column (which is mandatory). So, with ";?", python will check if the ';' if found or not (for the last field).

So, from this piece of information, you can write a python function to extract the transcript_id from a gff line:

#!usr/bin/env python3                                                                                                                                                                                                                        

# Stdlib Python3                                                                                                                                                                                                                             
import re

# Constants                                                                                                                                                                                                                                  
GFF_SEP = "\t"

# Create a function to extract the transcript_id from a gff line.                                                                                                                                                                            
# We assume that there is only ONE transcript_id per line.                                                                                                                                                                                   
def get_transcript_id(gffline):
    """ Returns the transcript_id (str) in the gffline.                                                                                                                                                                                      
    None if transcript_id not found.                                                                                                                                                                                                         
    """
    # We first extract the attribute field (the last one) as you know that your tag is here
    attribute_field = gffline.strip().split(GFF_SEP)[-1]
    regex_pattern = re.compile("transcript_id\s([^;]+);?")
    regex_results = regex_pattern.search(attribute_field)

    try:
        return regex_results.group(1)
    except AttributeError:
        return None


def main():
    """                                                                                                                                                                                                                                      
    """
    one_gff_line = """1\tprocessed_transcript\ttranscript\t11869\t14409\t.\t+\t.\tgene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";"""

    tr_id = get_transcript_id(one_gff_line)
    if tr_id:
        print("transcript_id extracted = %s" % (tr_id))
    else:
        print("transcript_id tag not found.")

if __name__ == "__main__":
    main()

And as you can see, this function can easily be extended to match any tag you want, by changing the function signature adding an other parameter: "tag_name". ;)

I hope this helps and feel free to ask if something is not clear.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by glihm610

I hope there are no transcript IDs (or anything else that you want to extract) that contain a semi-colon...

ADD REPLYlink written 2.2 years ago by Devon Ryan92k

That's a good comment, usually that's right I don't have any ID with ';' inside them. That's right, I was reviewing your code and it's more general as you don't have to care about the ID content. :)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by glihm610
0
gravatar for Macspider
2.2 years ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:

One comment only (I cannot add anything to what was already discussed here above):

GTF format files SHOULD have, in field 9, first "gene_id" and then "transcript_id". If the file format is respected, which you should assume so in the first place, doing a simple:

line.rstrip("\b\r\n").split("\t")[8].split("; ")[1].split(" ")[1].strip("\"")

Will ensure that, from each line, you extract the transcript_id because it should always be the second item of the 9th field, after gene_id.

Of course, if you don't trust the format to be respected, using re.search might be a good choice.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Macspider3.0k
1

Note that the = is particular to GFF and isn't present in GTF.

ADD REPLYlink written 2.2 years ago by Devon Ryan92k
1

Good catch. Edited.

ADD REPLYlink written 2.2 years ago by Macspider3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1305 users visited in the last hour