How To Retain One Cds Out Of Multiple Isoforms In Human Genomic Data
2
0
Entering edit mode
8.3 years ago
qiyunzhu ▴ 430

Dear all,

Wish you all have a great weekend. I am trying to generate a non-redundant collection of human protein-coding genes from the human reference genome (e.g., ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_Y/hs_alt_HuRef_chrY.gbk.gz). As we know there can be several protein isoforms associated with one gene. This resulted in a total number of 60,000 proteins. I don't want this many. I just need any one (ideally the longest one) out of the several isoforms per gene. I need the NCBI accession numbers (NP_xxxxxx) of them. Is there anyway I can achieve that? Thank you!

Best,

genomics cds • 2.9k views
2
Entering edit mode
8.3 years ago
PoGibas 5.0k

So what is your question? "Is there anyway I can achieve that?" If so, the answer is: "Yes, you can" :)

Ok, but now serious stuff. As your link was only an example I wrote this "Extracting longest CDS" for the Gencode v18 data.

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_18/gencode.v18.annotation.gtf.gz" |
gunzip -c |
awk 'BEGIN {OFS="\t"}; ($3 ~/CDS/ &&$14 ~ /protein_coding/) {print $10,$12,$5-$4}' |
tee Transcript_CDS |          # gene_id transcript_id CDS_length
cut -f 1 |
sort -u \
> Gene_id_list                   # Total gene_id list

grep -w "$GENE" Transcript_CDS | tee Gene_data | # For every gene_id extract transcripts & CDS (makes it easier in the following steps) cut -f 2 | sort -u \ > Transcript_id_list # Total transcript_id list (for$GENE)
awk -v TRANSCRIPT="$TRANSCRIPT" '($2==TRANSCRIPT) {LENGTH+=$3} END {print LENGTH"\t"TRANSCRIPT}' Gene_data # For every$TRANSCRIPT sum CDS length
done < Transcript_id_list |
sort -nr -k1 |
cut -f 2 \
>> Longest_CDS_list           # Transcript_id's that have longest CDS
done < Gene_id_list


This will only get you transcript_id that has longest CDS (compared to other transcripts). It's not perfect (but you only have to run it once). I also want some feedback from the community on it as I am only starting writing this kind of things. What you can do with this "Longest_CDS_list"?:

Extract transcript info from the original gtf;
Extract translated sequence from here;
Convert them to NCBI accesion number (my answer doesn't cover this part).

Wish you have a great weekend too! :)

0
Entering edit mode

Dear Pgibas: Sorry for my late reply! I have had a very busy week. Your code works on my laptop without any problem. I attempted to understand your idea. Although I haven't used AWK... The code definitely looks great. Thanks for your effort!

1
Entering edit mode
8.2 years ago
qiyunzhu ▴ 430

Later I worked out a Perl script to do the job. It's a quite verbose piece of script. It works on the human chromosome Y as I posted in this thread. I recommend Pgibas's answer which is definitely easier. Here is mine:

#!/usr/bin/perl

use warnings;
use strict;

my $current_gene = ""; my$current_max_size;
my $current_max_protein; while (<>){ chomp; if (/^ gene /){ if ($current_gene and $current_max_size){ print "$current_max_protein\t$current_max_size\t$current_gene\n";
}
my $s = <>; chomp$s;
if ($s =~ /^\s+\/gene="(.+)"$/){
$current_gene =$1;
$current_max_size = 0;$current_max_protein = "";
next;
}
}
if (/^     CDS             (.+)$/){ my$region = $1; while (1){ my$s = <>;
last if $s =~ /\s+\//; chomp$s;
$region .=$s;
}
$region =~ s/[^0-9\.,]//g; my$size = 0;
my @a = split (/,/, $region); foreach (@a){ if (/(\d+)\.\.(\d+)/){$size += abs($1-$2)+1;
}
}
my $protein = ""; while (1){ my$s = <>;
chomp $s; if ($s =~ /^\s+\/protein_id="(.+)"$/){$protein = $1; last; } } print "$protein: $size\n"; if ($size > $current_max_size){$current_max_size = $size;$current_max_protein = $protein; } next; } } if ($current_gene and $current_max_size){ print "$current_max_protein\t$current_max_size\t$current_gene\n";
}