Extracting sequence and annotation of a particular gene
1
0
Entering edit mode
5.1 years ago
Adrian Pelin ★ 2.6k

Hello, I have the human and mouse GENCODE annotations and the genome that goes with it. I am interested to create sequence and annotation files for individual genes.

For instance, if I am interested in RSAD2, I can easily find all annotations in the GTF file. What I want is to first extract the entire gene sequence forward sense into a separate fasta file (the gene feature from GENCODE). Then I want to extract all RSAD2 annotations (UTR, exon, CDS, etc.) from the original GENCODE file but modify coordinates so it matches my newly extracted fasta file.

Is there any tool that does this? I am trying to write a bash script but it is a pain so far....

Adrian

Annotation GTF • 1.7k views
ADD COMMENT
1
Entering edit mode
5.1 years ago
ATpoint 81k

My crude suggestion:

function GetAnnotFromGene {

  GENE=$1
  GTF=$2
  FASTA=$3

  grep -w "${GENE}" "${GTF}" > "${GENE}"_grepped.tmp

  OFFSET=$(awk '{if($3 == "gene") print $4-1}' "${GENE}"_grepped.tmp)

  awk -v offset=$OFFSET \
    'OFS="\t" \
    { print $1, $2, $3, $4-offset, $5-offset, $6, $7, $8, $9 }' \
    "${GENE}"_grepped.tmp > "${GENE}"_annotations.txt

  awk '$1 ~ /^#/ {print $0;next} {if($3 == "gene") print}' "${GENE}"_grepped.tmp | bedtools getfasta -s -bed - -fi "${3}" -fo "${GENE}".fa

  rm "${GENE}"_grepped.tmp

}

GetAnnotFromGene Spi1 gencode_annotations.gtf hg38.fa

Will output the gene sequence (strand-specific) as GENE.fa and its annotations with coordinates shifted to have start as 1 named GENE_annotations.txt. It is slow because grep searches the entire GTF, one can or sure improve this.

ADD COMMENT
0
Entering edit mode

Very elegant, thanks! It seems your script may neglect strand info when it comes to annotation. Since getfasta gets a forward sequence for something that is minus stranded, the annotations likely needed to be reshaped as well. In your case, regardless of strand the coordinates are simply brought to the first column 4 coordinate of the gene feature. I was thinking, in case of negative strand the offset should be column 5 (last gene coordinate since it is actually the start of the feature). In that case, is it correct to subtract from that offset all other coordinates? i.e. offset - $4, offset - $5? Thanks again for the help! P.S. to speed grep up, is it faster to first awk the file for string gene in the third column and then grep for gene name? I guess if this needs to be speedy a separate file can be made of just gene features.

ADD REPLY

Login before adding your answer.

Traffic: 2683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6