GTF file how do I search for duplicates of Gene ID and print them out
2
0
Entering edit mode
3 months ago
So ▴ 10

I am currently working on a GTF file that is tab-separated, and I need to use bash to find and extract any Gene IDs with more than one transcript, then print out these Gene IDs, how many transcripts was found (count), and the transcripts themselves.

GTF Bash • 1.0k views
2
Entering edit mode

What have you tried? This is pretty straightforward with core utils (grep/cut/uniq) and awk.

0
Entering edit mode

I haven't worked with bash before that's why I don't know the utilities, so I thought of asking here.

1
Entering edit mode

If you do not need to explicitly use bash you can have a look at the AGAT tool, I'm pretty sure it will have an option to achieve this.

AGAT - Another Gff/Gtf Analysis Toolkit

0
Entering edit mode

So : I second this. As a beginner in bash, you will run into a bunch of challenges that are great as a learning experience, but will result in delays in getting things done. For the moment, use this tool that already exists. The agat_sp_statistics script seems interesting - it might give you number of transcripts per gene statistic. If not, I'd recommend using R instead of bash as you're looking to do a group/aggregate operation which is done easier on R.

0
Entering edit mode

Ram you've provided alot of help, thank you so much. I will make sure to learn more R and try challenges abit far from Bash.

0
Entering edit mode

Unfortunately, I have to use bash, but I will still check AGAT for future use. Thank you!

0
Entering edit mode

Could you provide a sample of the data, please?

You can do the following to print out the top (head) 10 lines.

head -10 your_gtf_file.gtf

or print out the top (head) 5 lines:

head -5 your_gtf_file.gtf

and then past your output in your question. Remember to use the '010101' button to put it in code format so it's easier to visualize and copy and paste. This will help myself or someone else work with it to give you the output you need.

0
Entering edit mode
VFFH01000932.1  StringTie   transcript  716340  733649  1000    +   .   "gene_id ""STRG.14845""; transcript_id ""STRG.14845.6""; cov ""2.191051""; FPKM ""0.287366""; TPM ""0.383253"";"
VFFH01001189.1  StringTie   transcript  11702   12012   1000    .   .   "gene_id ""STRG.14846""; transcript_id ""STRG.14846.1""; cov ""4.823151""; FPKM ""0.632579""; TPM ""0.843654"";"
VFFH01000605.1  StringTie   transcript  72502   81013   1000    +   .   "gene_id ""STRG.14847""; transcript_id ""STRG.14847.1""; cov ""3.486188""; FPKM ""0.457230""; TPM ""0.609796"";"
VFFH01001300.1  StringTie   transcript  31857   39959   1000    +   .   "gene_id ""STRG.14848""; transcript_id ""STRG.14848.1""; cov ""54.615608""; FPKM ""7.163090""; TPM ""9.553233"";"
VFFH01001300.1  StringTie   transcript  558348  570819  1000    -   .   "gene_id ""STRG.14849""; transcript_id ""STRG.14849.1""; cov ""40.338367""; FPKM ""5.290564""; TPM ""7.055892"";"
VFFH01001300.1  StringTie   transcript  577718  578223  1000    -   .   "gene_id ""STRG.14850""; transcript_id ""STRG.14850.1""; cov ""41.142540""; FPKM ""5.396035""; TPM ""7.196556"";"
VFFH01001300.1  StringTie   transcript  578540  590271  1000    +   .   "gene_id ""STRG.14851""; transcript_id ""STRG.14851.1""; cov ""27.357616""; FPKM ""3.588078""; TPM ""4.785329"";"
VFFH01001300.1  StringTie   transcript  54115   551310  1000    -   .   "gene_id ""STRG.14852""; transcript_id ""STRG.14852.1""; cov ""76.314400""; FPKM ""10.008987""; TPM ""13.348734"";"
VFFH01001300.1  StringTie   transcript  54115   86777   1000    -   .   "gene_id ""STRG.14852""; transcript_id ""STRG.14852.2""; cov ""7.318105""; FPKM ""0.959803""; TPM ""1.280066"";"
VFFH01001300.1  StringTie   transcript  69240   551310  1000    -   .   "gene_id ""STRG.14852""; transcript_id ""STRG.14852.3""; cov ""8.781900""; FPKM ""1.151787""; TPM ""1.536109"";"


So this is the last few heads of the gtf file, there are over a thousand in there.

0
Entering edit mode

So basically what I want is to write bash code that checks if a gene id has more than one transcript. If so it collects the Gene Ids, how many transcripts map to it, and what are the ids of these transcripts "Transcript ID".

Please, provide me mainly with the sources to learn so I can handle such tasks on my own, I love having such helpful community, but I mainly love to rely on myself, so please do provide me with what I need to learn as well so I can do even more complex tasks on my own.

Thank you!

0
Entering edit mode

0
Entering edit mode

Okay, so I worked abit on it and here is what I have come to, and I still need help.

I will rephrase the whole problem real quick: I am currently working on a GTF file, here is an example of one header for the file.

VFFH01002606.1  StringTie   transcript  246006  246411  1000    .   .   "gene_id ""STRG.1""; transcript_id ""STRG.1.1""; cov ""6.849754""; FPKM ""0.898377""; TPM ""1.198143"";"


Every line looks like this with different values, and I simply wanted to Extract Gene ID (e.g. STRG.1) and then find if this gene ID is repeated, if it is repeated (duplicate) extract all the Transcript IDs that matches that gene ID (in this line only STRG.1.1) and put all of this in one row! and tell me how many times did you find that Gene ID.

What I tried to do was

awk -F " " '{ print $12,$10}'  Downloads/t1.gtf | tr -d \" |tr -d \; | sort -n | uniq -D -f1


Gave me this output ( a single header as an example )

3 STRG.9999.1 STRG.9999


It counted the occurrence correctly, but compressed all Transcript IDs to just one, didn't add them in a single row.

Then, I stumbled upon this piece of code from the user in stackoverflow (credits to the user) https://unix.stackexchange.com/users/65304/steeldriver?tab=profile answered it as follows

awk '
BEGIN{FS="\t"; OFS=FS};
{ arr[$1] = arr[$1] == ""? $2 : arr[$1] "," $2 } END {for (i in arr) print i, arr[i] }' data  I edited it to awk ' BEGIN{FS=" "; OFS=FS}; { arr[$10] = arr[$10] == ""?$12 : arr[$10] ","$12 } END {for (i in arr) print i, arr[i] }' Downloads/t1.gtf| tr -d \" | tr -d \; | sort -n


It gave me this

STRG.9983 STRG.9983.1,STRG.9983.2,STRG.9983.3,STRG.9983.4,STRG.9983.5


Now I am stuck with two things:

I don't really understand the code, could someone explain it? (Below is what I managed to understand)


I changed the separator to be white space instead of tabs to recognize the fields in column 3, I don't understand the use of OFS=FS. For the second part I understand that it creates an array for the row that will be concatenated to. don't understand a thing after the equal sign.

How can I count the repeated transcript IDs?


As for the last post, it has 5 different transcript IDs for the same gene ID. How do I print it out next to them in a different field?

I tried awk print NF, using comma separation for the file but it replaced the previous output with the count. I only added this to the previous code.

awk -F"," '{print NF}'

2
Entering edit mode
3 months ago
Pratik Mehta ▴ 570

Save the following script to duplicate.sh and place it in your directory where the gtf/gff file is:

#!/bin/bash

#this will set a variable for the duplicated gene ids
duplicated_gene_ids=$(cat yourgtffile.txt | awk -F " " '{ print$10,$12 }' | tr -d '"' | tr -d ';' | cut -d " " -f1 | uniq -d) #the following line removes the output file (extracted.txt) prior to start rm extracted.txt #the following for loop will cycle throug the variable$duplicated_gene_ids
#generating extracted.txt and print the counted occurrence of duplicates
for singlegeneid in ${duplicated_gene_ids} do grep${singlegeneid} yourgtffile.txt | awk -F " " '{ print $10,$12 }' | tr -d '"' | tr -d ';' >> extracted.txt
grep --count ${singlegeneid} yourgtffile.txt | awk '{ print "'"${singlegeneid}"'" " occurred " $1 " times" } ' done  Use it like this: pratik@pratik:~/Desktop/So$ bash duplicate.sh


Output:

STRG.14852 occurred 3 times


Output of extracted.txt:

STRG.14852 STRG.14852.1
STRG.14852 STRG.14852.2
STRG.14852 STRG.14852.3


I must comment that using "'"${singlegeneid}"'" is generally not a good idea as I've read that people can inject code into that somehow like I guess run entire scripts (ie. viruses etc) through the variable somehow. I guess the double "'" is what allows this. Therefore there probably is a better way to write this. but this accomplishes what you need, I think. Hope this helps! ADD COMMENT 1 Entering edit mode 3 months ago Pratik Mehta ▴ 570 Hi So, Here is your script in both bash and R, as this was easier for me. (I also used some shortcuts here), I will make a bash solution at some point... it will just take more time as I have a Master's project to tend to that I've been putting aside!!! I must say though Google is your friend. you can google things like: count duplicates in bash and the second link is someone asking the question on stackoverflow: https://stackoverflow.com/questions/6712437/find-duplicate-lines-in-a-file-and-count-how-many-time-each-line-was-duplicated and then piece that together with the next question you have to build your script. Anyways here's my try at this. So you would just have to change the file paths appropriately. Your current output is this: pratik@Pratiks-MacBook-Air So % cat transcripts.txt VFFH01000932.1 StringTie transcript 716340 733649 1000 + . "gene_id ""STRG.14845""; transcript_id ""STRG.14845.6""; cov ""2.191051""; FPKM ""0.287366""; TPM ""0.383253"";" VFFH01001189.1 StringTie transcript 11702 12012 1000 . . "gene_id ""STRG.14846""; transcript_id ""STRG.14846.1""; cov ""4.823151""; FPKM ""0.632579""; TPM ""0.843654"";" ...  Doing the following: cat transcripts.txt | tr -d '"' | tr -d ';' | awk -F " " '{ print$3,$4,$5,$10,$12 }' | tr "." ' ' | awk -F " " '{ print $1,$2,$3,$4 "." $5,$6 "." $7 " "$8 }' > transcripts_cleaned.txt


Will result in a text file "transcripts_cleaned.txt" that looks like this:

pratik@Pratiks-MacBook-Air So % cat transcripts_cleaned.txt
transcript 716340 733649 STRG.14845 STRG.14845 6
transcript 11702 12012 STRG.14846 STRG.14846 1
transcript 72502 81013 STRG.14847 STRG.14847 1
...


You can see here that I basically used the number after transcript id to provide a number for the counts of transcripts. The last column being the counts.

Now in R:

The following script or variations should get you what you want:

df <- read.table("~/Desktop/biostars/So/transcripts_cleaned.txt", sep = " ")


# this will remove geneIDS/trasncript id if they are duplicates - just to make sure they match

df2 <- df[!duplicated(as.list(df))]

install.packages('dplyr')
library(dplyr)
df.with.more.than.one.count <-  df2 %>% group_by(V4) %>% filter(n()>1)
df.with.gene.ids.transcripts <- df.with.more.than.one.count %>% group_by(V4) %>% filter(V6==max(V6))


df.with.gene.ids.transcripts$V4  # the following write just the gene ids with more than one transcript to the file path write.table(df.with.gene.ids.transcripts$V4, file = "~/Desktop/biostars/So/geneids_with_more_than_one_transcript.txt", sep= " ", col.names = F, row.names =  F)


# lastly, this writes the whole "df.with.gene.ids.transcripts" data frame which contains the information you requested to the file path

write.table(df.with.gene.ids.transcripts, file = "~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame.txt", sep= " ", col.names = F, row.names =  F)


if you want to remove quotes in the file you can use the tr for translate like this:

cat ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame.txt | tr -d '"' > ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame_no_quotes.txt


and

cat ~/Desktop/biostars/So/geneids_with_more_than_one_transcript.txt | tr -d '"' >  ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_without_quotes.txt


I hope this helps.

1
Entering edit mode

This was very detailed and helpful! Thank you so much!