Question

Find genome coordinates of nucleotide sequence within specific gene

0

Entering edit mode

4.0 years ago

Mr Locuace ▴ 160

I have a nucleotide sequence "CACTGCC" and I want to know the physical coordinates (start position and end position) of this sequence in a specific gene (all occurrences), let's say NOTCH1. Here there is a Biopython solution that scans the whole genome and retrieves all occurrences of the sequence:

Locating A Sequence In A Fasta File.

However, this takes very long and it seems unnecessary to scan the whole genome if I am only interested in a single gene. Thanks !!

Biopython FASTA Physical positions • 2.5k views

ADD COMMENT • link updated 4.0 years ago by Alex Reynolds 35k • written 4.0 years ago by Mr Locuace ▴ 160

score 2 · Accepted Answer · 2020-04-23

Your sequence is too short to use with BLAT or BLAST, I think.

Here's one grep-based approach you can parallelize on a per-chromosome basis.

This query ran in a second or two, for me. I think that's pretty fast.

We can look up NOTCH1 and we know it is found on chr9.

For demonstration purposes, I'll download the FASTA sequence for that chromosome for genome hg38, but the principle is the same for other chromosomes and genome assemblies:

$ wget -qO- https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr9.fa.gz | gunzip -c > chr9.fa

This will give you a list of all the 1-based byte offsets where your string of interest is found:

$ tail -n+2 chr9.fa | tr -d '\n' | grep -F -b -o "CACTGCC" | cut -d':' -f1 > forwardHitOffsets.txt

The -F option is a fixed-string search (exact match). The -b -o combination returns a 1-indexed byte offset. Because genomic sequence characters are single bytes, we can use those numbers directly as offsets or genomic positions.

Maybe your sequence is on the reverse strand, so you'd want to do a lookup for the reverse complement:

$ tail -n+2 chr9.fa | tr -d '\n' | grep -F -b -o "GGCAGTG" | cut -d':' -f1 > reverseHitOffsets.txt

To make a BED file from the forward-stranded hits:

$ awk -v SEQ="CACTGCC" -v OFS="\t" -v CHR="chr9" '{ print CHR, ($1 - 1), ($1 + length(SEQ) - 1); }' forwardHitOffsets.txt | sort-bed - > forwardHits.bed

For the reverse-stranded hits:

$ awk -v SEQ="GGCAGTG" -v OFS="\t" -v CHR="chr9" '{ print CHR, ($1 - 1), ($1 + length(SEQ) - 1); }' reverseHitOffsets.txt | sort-bed - > reverseHits.bed

To make one file:

$ bedops --everything forwardHits.bed reverseHits.bed > hits.bed

All the above steps can be worked into a parallelized pipeline, to eliminate intermediate files as well as to run queries in parallel, per-chromosome. I'll leave that as homework.

Once you have the file hits.bed, you can do set operations to get all hits overlapping a specific gene, e.g., NOTCH1:

$ bedmap --echo --echo-map genes.bed hits.bed | grep NOTCH1 > answer.txt

The file genes.bed would come from a BED file of refSeq genes from UCSC or GENCODE annotations, or other sources, where genes use HGNC symbol names.

That's a separate question, but one for which there are answers in other questions on biostars and Bioinformatics SE.

Nonetheless, to complete the demonstration:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz \
  | gunzip -c - \
  | awk -v OFS="\t" '{ if (!match($13, /.*-[0-9]+/)) { print $3, $5, $6, $13, ".", $4; } }' - \
  | sort-bed - \
  > refGene.hg38.sorted.bed

You'd replace genes.bed with refGene.hg38.sorted.bed in this demo. This would give the following result:

$ bedmap --echo --echo-map refGene.hg38.sorted.bed hits.bed | grep NOTCH1
chr9    136494432       136546048       NOTCH1  .       -|chr9  136495154       136495161;chr9  136495883       136495890;chr9  136500162       136500169;chr9  136500485       136500492;chr9   136500540       136500547;chr9  136501856       136501863;chr9  136503293       136503300;chr9  136506434       136506441;chr9  136506654       136506661;chr9  136507631        136507638;chr9  136508069       136508076;chr9  136510705       136510712;chr9  136515917       136515924;chr9  136517565       136517572;chr9  136521243       136521250;chr9   136521391       136521398;chr9  136523091       136523098;chr9  136525729       136525736;chr9  136528751       136528758;chr9  136530034       136530041;chr9  136532948        136532955;chr9  136536555       136536562;chr9  136538278       136538285;chr9  136540409       136540416

The answer is delimited in two pieces with a pipe symbol (|).

The first piece is the NOTCH1 annotation. The second piece is a semi-colon delimited list of all the locations where your string (or its reverse complement) is found, which overlap NOTCH1 by one or more bases.

You can add overlap constraints to bedmap. For instance, if you want to ensure that the entire query sequence is found inside the bounds of NOTCH1, add --fraction-map 1:

$ bedmap --echo --echo-map --fraction-map 1 refGene.hg38.sorted.bed hits.bed > answer.txt

Read bedmap --help for more options.