Choosing a single consensus promoter/transcript for each gene (Mouse)
1
0
Entering edit mode
13 months ago
srhic ▴ 40

Hello,

I am doing some analysis using a pipeline that links enhancers to their putative target promoters. The pipeline requires as input (among other data) a bed file containing transcript start and end coordinates for each gene. However, it expects each gene to be represented by a single transcript so it can link a 500bp region flanking that TSS to an enhancer.

I have downloaded refseq annotation from ucsc table browser but it obviously contains multiple transcripts for each gene. I am not sure what would be a reasonable criterion for choosing a single representative transcript for each gene. Are there any options in the ucsc table browser that I may use to narrow down the most 'representative' transcript for each gene? Or are there any other annotation databases that may be helpful for finding a consensus TSS for each gene?

Thanks

Edit: I noticed that ensembl biomart has a 'gene start' and 'gene end' attribute instead of tss and tts which gives the outermost tss and tts for each gene. I was wondering if it makes sense to use this and if there a way I can get this from the ucsc table browser? (don't want to use ensembl as i have been using refseq annotation for everything else)

annotation ucsc general • 1.3k views
0
Entering edit mode

If you are referring to human genome then take a look at the MANE project as a potential source of single representative (with caveats) transcript per gene.

0
Entering edit mode

Thanks. I am working with the mouse genome. Will edit my original post to clarify.

5
Entering edit mode
13 months ago
vkkodali_ncbi ★ 3.1k

For mouse, you should take a look at the RefSeq Select project. That will give you one-transcript-per-gene. To get to a BED file with transcript start and end positions, you can do something like this:

## download a list of Mouse RefSeq Select transcripts using Entrez Direct
$esearch -db nuccore -query 'refseq_select[sb] AND txid10090[orgn]' | efetch -format acc > accs.txt$ wc -l accs.txt
21099 accs.txt

## download RefSeq mouse annotation in GFF3 format
$curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/10090/109/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.gff.gz ## A hackish way to generate a BED file from the GFF3 with transcript start and end positions$ zgrep -v '^#' GCF_000001635.27_GRCm39_genomic.gff.gz \
| awk 'BEGIN{FS="\t";OFS="\t"}($3!="exon" &&$3!="CDS" && $9~/transcript_id/){print$1,$4-1,$5,$7,$9}' \
| sed -r 's/ID=.*transcript_id=([^;]*).*/\1/g' \
| awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,$5,"1000",$4}' > GRCm39_transcripts.bed$ head -n2 GRCm39_transcripts.bed
NC_000067.7     3172238 3172348 XR_004936710.1  1000    +
NC_000067.7     3269955 3741733 XM_006495550.5  1000    -

## Then filter the RefSeq Select rows
## grep for accession without version in case the accession was updated after the release of GFF3
$cut -f1 -d'.' accs.txt | grep -f - -w GRCm39_transcripts.bed > GRCm39_transcripts.RefSeq_Select.bed$ head -n2 GRCm39_transcripts.RefSeq_Select.bed
NC_000067.7     3284704 3741721 NM_001011874.1  1000    -
NC_000067.7     4190088 4430526 NM_001370921.1  1000    -

0
Entering edit mode

Thanks a lot for the detailed reply! This is exactly what I needed.

0
Entering edit mode

Hello, I was trying to use this code and had a couple of questions. Would appreciate if you can help.

1. When I run the last line of code to grep the accession it gets stuck and keeps running until I terminate it. It worked when I replaced 'grep -f' with 'grep -F' to treat the input as text but I wasn't sure if I should do it.
2. The final file had ~60,000 entries. Shouldn't it have 21,099 entries (same as accessions.txt)? Also how do I get standard chromosome names in the bed file in place of NC_000067.7 etc.

Since my data is mapped to GRCm38, I used the gff file for that but I assume that should be ok.

0
Entering edit mode

Are you running this on a Mac? The grep command there is supposed to be different from the ones that come with a linux distribution. I am running GNU grep 3.4 on a Linux machine. One alternative is to drop the version numbers from both accs.txt and GRCm38_transcripts.bed file and just the join command instead of grep.

The final file should not have that many transcripts. Something is not correct. My GRCm38_transcripts.bed file has 123654 rows. After filtering for RefSeq Select transcripts, I see 22560 rows in my GRCm38_transcripts.RefSeq_Select.bed file. Why do you see >21099 rows you ask? That's because several transcripts are annotated on multiple genomic accessions. If you are not interested in anything other than the named chromosomes chr1-chr19,X,Y (and may be mitochondria), you can drop all rows not starting with "NC_".

$cat GRCm38_transcripts.RefSeq_Select.bed | datamash count 4 countunique 4 22560 20965$ grep '^NC_' GRCm38_transcripts.RefSeq_Select.bed | datamash count 4 countunique 4
20940   20940


In general, you should expect to see <21099 rows because some of the RefSeq Select transcripts from accs.txt are expected to be absent in the GFF3. Note, GFF3 is from a specific point in time when the genome was annotated where as the RefSeq Select accessions you are retrieving using Entrez Direct are live at the time of your search; curation activities occurring since the annotation are reflected in the Entrez Direct results but not in the GFF3.

Check out https://github.com/vkkodali/cthreepo for a script that can help with swapping chromosomes names.

0
Entering edit mode

Thanks, I am using linux but it seems grep wasnt working as expected for some reason. Using join after dropping version numbers as you recommended gave me ~22000 lines as expected.

One final thing I was wondering about is if I really need to do this with a gff file. I downloaded a standard bedfile of the refseq track from ucsc table browser and just used the join command on that bed file and the accs.txt. It gave me similar results with around 20,000 transcripts. Is their anything wrong with this approach? It seemed simpler to do it this way as it didn't involve converting gff to bed and then converting chromosome names.

1
Entering edit mode

In general, I recommend getting data directly from the original source as much as possible. It is _probably_ fine if you download the "RefSeq All" table since UCSC too parses RefSeq GFF3 files as they say here:

Data files were downloaded from RefSeq in GFF file format and converted to the genePred and PSL table formats for display in the Genome Browser.

But the other issue to keep in mind is how up-to-date the data are. For example, the UCSC track currently shows data from 2017-08-04 for GRCm38 where as the latest version of annotation for GRCm38 is 108.20200622 released in 2020.

1
Entering edit mode

Compare your list to list of official mouse gene symbols from MGI (column 3 in this file). The Mouse Genome Informatics Database is the authoritative source of official names for mouse genes, alleles, and strains. At moment there appear to be 23,049 protein-coding genes.