How To Retain One Cds Out Of Multiple Isoforms In Human Genomic Data
2
0
Entering edit mode
10.5 years ago
qiyunzhu ▴ 430

Dear all,

Wish you all have a great weekend. I am trying to generate a non-redundant collection of human protein-coding genes from the human reference genome (e.g., ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_Y/hs_alt_HuRef_chrY.gbk.gz). As we know there can be several protein isoforms associated with one gene. This resulted in a total number of 60,000 proteins. I don't want this many. I just need any one (ideally the longest one) out of the several isoforms per gene. I need the NCBI accession numbers (NP_xxxxxx) of them. Is there anyway I can achieve that? Thank you!

Best,

genomics cds • 3.5k views
ADD COMMENT
2
Entering edit mode
10.5 years ago
PoGibas 5.1k

So what is your question? "Is there anyway I can achieve that?" If so, the answer is: "Yes, you can" :)

Ok, but now serious stuff. As your link was only an example I wrote this "Extracting longest CDS" for the Gencode v18 data.

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_18/gencode.v18.annotation.gtf.gz" |
   gunzip -c |
   awk 'BEGIN {OFS="\t"}; ($3 ~/CDS/ && $14 ~ /protein_coding/) {print $10,$12,$5-$4}' |
   tee Transcript_CDS |          # gene_id transcript_id CDS_length
   cut -f 1 | 
   sort -u \
> Gene_id_list                   # Total gene_id list

while read GENE; do
   grep -w "$GENE" Transcript_CDS |  
      tee Gene_data |            # For every gene_id extract transcripts & CDS (makes it easier in the following steps) 
      cut -f 2 | 
      sort -u \
   > Transcript_id_list          # Total transcript_id list (for $GENE)
   while read TRANSCRIPT; do
      awk -v TRANSCRIPT="$TRANSCRIPT" '($2==TRANSCRIPT) {LENGTH+=$3} END {print LENGTH"\t"TRANSCRIPT}'  Gene_data          # For every $TRANSCRIPT sum CDS length 
   done < Transcript_id_list |
      sort -nr -k1 |
      head -n1 |
      cut -f 2 \
   >> Longest_CDS_list           # Transcript_id's that have longest CDS
done < Gene_id_list

This will only get you transcript_id that has longest CDS (compared to other transcripts). It's not perfect (but you only have to run it once). I also want some feedback from the community on it as I am only starting writing this kind of things. What you can do with this "Longest_CDS_list"?:

Extract transcript info from the original gtf;
Extract translated sequence from here;
Convert them to NCBI accesion number (my answer doesn't cover this part).

Wish you have a great weekend too! :)

ADD COMMENT
0
Entering edit mode

Dear Pgibas: Sorry for my late reply! I have had a very busy week. Your code works on my laptop without any problem. I attempted to understand your idea. Although I haven't used AWK... The code definitely looks great. Thanks for your effort!

ADD REPLY
1
Entering edit mode
10.5 years ago
qiyunzhu ▴ 430

Later I worked out a Perl script to do the job. It's a quite verbose piece of script. It works on the human chromosome Y as I posted in this thread. I recommend Pgibas's answer which is definitely easier. Here is mine:

#!/usr/bin/perl

use warnings;
use strict;

my $current_gene = "";
my $current_max_size;
my $current_max_protein;

while (<>){
    chomp;
    if (/^     gene            /){
        if ($current_gene and $current_max_size){
            print "$current_max_protein\t$current_max_size\t$current_gene\n";
        }
        my $s = <>;
        chomp $s;
        if ($s =~ /^\s+\/gene="(.+)"$/){
            $current_gene = $1;
            $current_max_size = 0;
            $current_max_protein = "";
            next;
        }
    }
    if (/^     CDS             (.+)$/){
        my $region = $1;
        while (1){
            my $s = <>;
            last if $s =~ /\s+\//;
            chomp $s;
            $region .= $s;
        }
        $region =~ s/[^0-9\.,]//g;
        my $size = 0;
        my @a = split (/,/, $region);
        foreach (@a){
            if (/(\d+)\.\.(\d+)/){
                $size += abs($1-$2)+1;
            }
        }
        my $protein = "";
        while (1){
            my $s = <>;
            chomp $s;
            if ($s =~ /^\s+\/protein_id="(.+)"$/){
                $protein = $1;
                last;
            }
        }
        print "  $protein: $size\n";
        if ($size > $current_max_size){
            $current_max_size = $size;
            $current_max_protein = $protein;
        }
        next;
    }
}
if ($current_gene and $current_max_size){
    print "$current_max_protein\t$current_max_size\t$current_gene\n";
}
ADD COMMENT

Login before adding your answer.

Traffic: 2367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6