First, the @Devon Ryan solution is something you have to consider as he already spent time to wrote and check this code. He is using the CSV module to ensure that the split is correctly done (taking in account the double quote or no, etc...).
If you still with your idea of Python REGEX, for sure it's possible.
1) Revise the format you want to parse: GTF (link to format description)
2) You want to extract an information from the last column (attribute column), which can be composed by several tags, separated by a semi-column (';').
3) Now you now want you want, you can use the Python REGEX to build the extraction code from the pseudocode like "{transcript_id}{\space}{id_value};", which gives you in python the following regex:
"transcript_id\s([^;]+);?"
where:
transcript_id => the tag you want to find
\s => represents a space, as in GTF format, the attributes are like this: "TAG{SPACE}VALUE;
([^;]+) => We want to extract ALL the characters EXCEPT the semi-column as the semi-column is the attribute separator. Also, we use the parenthesis to tell to Python "I want to extract this information". In the version you pasted, they use [^\s] to say that spaces are not allowed in the ID.
;? => In REGEX, using '?' tells that I don't know if this character is present. Sometimes, the end of the GTF line doesn't contain this semi-column (which is mandatory). So, with ";?", python will check if the ';' if found or not (for the last field).
So, from this piece of information, you can write a python function to extract the transcript_id from a gff line:
#!usr/bin/env python3
# Stdlib Python3
import re
# Constants
GFF_SEP = "\t"
# Create a function to extract the transcript_id from a gff line.
# We assume that there is only ONE transcript_id per line.
def get_transcript_id(gffline):
""" Returns the transcript_id (str) in the gffline.
None if transcript_id not found.
"""
# We first extract the attribute field (the last one) as you know that your tag is here
attribute_field = gffline.strip().split(GFF_SEP)[-1]
regex_pattern = re.compile("transcript_id\s([^;]+);?")
regex_results = regex_pattern.search(attribute_field)
try:
return regex_results.group(1)
except AttributeError:
return None
def main():
"""
"""
one_gff_line = """1\tprocessed_transcript\ttranscript\t11869\t14409\t.\t+\t.\tgene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";"""
tr_id = get_transcript_id(one_gff_line)
if tr_id:
print("transcript_id extracted = %s" % (tr_id))
else:
print("transcript_id tag not found.")
if __name__ == "__main__":
main()
And as you can see, this function can easily be extended to match any tag you want, by changing the function signature adding an other parameter: "tag_name". ;)
I hope this helps and feel free to ask if something is not clear.
Actually I do like this approach, thank you to share Devon. Your solution is a bit slower than regex, but definitively safer. Thanks!
Hey Ryan If I want to parse the GRCH38 or 37 transcripts (model transcripts) just from chr22, to use them in kallisto analisys for reads quantifications.
I could use your deeptools calling which functions? Paulo
I'd use awk instead, it'd be easier for filtering.