Python Regular Expressions to extract the transcript ID from lines of GTF file
5
0
Entering edit mode
5.1 years ago
oars ▴ 180

I'm sure there is a smart and simplistic way to do this but I'm stuck. I simply want to extract the transcript_id field using the import re (re.findall) from the lines in the GTF (FASTA homo_sapiens). This is what I have so far:

import re
f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
# Feed the files into findall(); it returns a list of all the found strings
print transcript_id


Where did I go wrong?

python transcript_id GTF • 4.1k views
2
Entering edit mode
5.1 years ago

It turns out to be really hard (and error-prone) to come up with a good regular expression to parse that out. In deepTools, I use the following process:

1. Split each line into columns by tab (cols = line.strip().split("\t")
2. Use the csv module to parse the last column (s = next(csv.reader([cols[8]], delimiter=' '))).
3. Get the column after transcript_id (s[s.index('transcript_id') + 1].rstrip(";"))

You can see an example of that here, which is the python module I wrote for deepTools to read BED/GTF files into a custom interval tree.

0
Entering edit mode

Actually I do like this approach, thank you to share Devon. Your solution is a bit slower than regex, but definitively safer. Thanks!

0
Entering edit mode

Hey Ryan If I want to parse the GRCH38 or 37 transcripts (model transcripts) just from chr22, to use them in kallisto analisys for reads quantifications.

I could use your deeptools calling which functions? Paulo

0
Entering edit mode

I'd use awk instead, it'd be easier for filtering.

0
Entering edit mode
5.1 years ago
oars ▴ 180

Many thanks Devon! Is it even possible to parse out transcript_id's with a regular expression (re.match, re.search, or re.findall)?

0
Entering edit mode

To your question, perhaps? I never got one to work with all of my weird test cases, but I suspect that's simply because I didn't try hard enough. The method I posted works with every weird case I could come up with, so I just left it at that.

0
Entering edit mode
5.1 years ago
oars ▴ 180

I found this code on stack exchange;

> if re.findall(r'transcript_id=[^\s]+',line):
> transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

> else:

>   transcript = "NA"


I don't understand the [^\s]+',line): portion of the code - what does this section do?

Would this work?

f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
>>> if re.findall(r'transcript_id=[^\s]+',line):
>>>transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

...else:

>   transcript = "NA"

0
Entering edit mode

That won't work. transcript_id=[^\s]+ will find all instances of transcript_id= followed by white-space. The \s should presumably be a \S (so, non-whitespace). Regardless, that won't work because (1) transcript IDs can contain white space and (2) there are no transcript_id= instances in a GTF file. That's the annoying part about transcript IDs (and gene names and such), they can contain anything, including quotes and delimiters.

0
Entering edit mode
5.1 years ago
glihm ▴ 650

First, the @Devon Ryan solution is something you have to consider as he already spent time to wrote and check this code. He is using the CSV module to ensure that the split is correctly done (taking in account the double quote or no, etc...).

If you still with your idea of Python REGEX, for sure it's possible.

1) Revise the format you want to parse: GTF (link to format description) 2) You want to extract an information from the last column (attribute column), which can be composed by several tags, separated by a semi-column (';'). 3) Now you now want you want, you can use the Python REGEX to build the extraction code from the pseudocode like "{transcript_id}{\space}{id_value};", which gives you in python the following regex:

"transcript_id\s([^;]+);?"


where:

transcript_id => the tag you want to find

\s => represents a space, as in GTF format, the attributes are like this: "TAG{SPACE}VALUE;

([^;]+) => We want to extract ALL the characters EXCEPT the semi-column as the semi-column is the attribute separator. Also, we use the parenthesis to tell to Python "I want to extract this information". In the version you pasted, they use [^\s] to say that spaces are not allowed in the ID.

;? => In REGEX, using '?' tells that I don't know if this character is present. Sometimes, the end of the GTF line doesn't contain this semi-column (which is mandatory). So, with ";?", python will check if the ';' if found or not (for the last field).

So, from this piece of information, you can write a python function to extract the transcript_id from a gff line:

#!usr/bin/env python3

# Stdlib Python3
import re

# Constants
GFF_SEP = "\t"

# Create a function to extract the transcript_id from a gff line.
# We assume that there is only ONE transcript_id per line.
def get_transcript_id(gffline):
""" Returns the transcript_id (str) in the gffline.
"""
# We first extract the attribute field (the last one) as you know that your tag is here
attribute_field = gffline.strip().split(GFF_SEP)[-1]
regex_pattern = re.compile("transcript_id\s([^;]+);?")
regex_results = regex_pattern.search(attribute_field)

try:
return regex_results.group(1)
except AttributeError:
return None

def main():
"""
"""
one_gff_line = """1\tprocessed_transcript\ttranscript\t11869\t14409\t.\t+\t.\tgene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";"""

tr_id = get_transcript_id(one_gff_line)
if tr_id:
print("transcript_id extracted = %s" % (tr_id))
else:

if __name__ == "__main__":
main()


And as you can see, this function can easily be extended to match any tag you want, by changing the function signature adding an other parameter: "tag_name". ;)

I hope this helps and feel free to ask if something is not clear.

0
Entering edit mode

I hope there are no transcript IDs (or anything else that you want to extract) that contain a semi-colon...

0
Entering edit mode

That's a good comment, usually that's right I don't have any ID with ';' inside them. That's right, I was reviewing your code and it's more general as you don't have to care about the ID content. :)

0
Entering edit mode
5.1 years ago
Macspider ★ 3.6k

One comment only (I cannot add anything to what was already discussed here above):

GTF format files SHOULD have, in field 9, first "gene_id" and then "transcript_id". If the file format is respected, which you should assume so in the first place, doing a simple:

line.rstrip("\b\r\n").split("\t")[8].split("; ")[1].split(" ")[1].strip("\"")


Will ensure that, from each line, you extract the transcript_id because it should always be the second item of the 9th field, after gene_id.

Of course, if you don't trust the format to be respected, using re.search might be a good choice.

1
Entering edit mode

Note that the = is particular to GFF and isn't present in GTF.

1
Entering edit mode

Good catch. Edited.