I recently used BLAT to generate a large number of alignments but the input files were on the chromosome level, not individual genes. I realize BLAT and filtering the subject by taxonomical ID is an option, but BLAT seems to be a stricter option. The alignments are 60bp with no gaps. What would be the best way to find the gene associated with each of these regions, if it exists?
As @AndreiR as said...
Download a bed file of all the genes of interest (from the ucsc table browser you can download ready-made bed files)
then make a bed file from your alignments:
then install bedtools and:
intersectBed -a alignments.bed -b refseq.bed -wo > intersections.txt
This should give you all the info you need.
One way to do this is to use BEDOPS psl2bed to convert UCSC BLAT output to UCSC BED, then use BEDOPS bedmap to list BED-formatted gene annotations associated with the BED coordinates of your BLAT search results.
$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_16/gencode.v16.annotation.gtf.gz \ | gunzip -c - \ | gtf2bed - \ | grep -w gene - \ > gencode.v16.genes.bed
Then, you could build a list of unique gene IDs that map to each BLAT search result with the following
bedmap command. Let's assume your BLAT results are in a file called
$ psl2bed < searchResults.psl \ | bedmap --echo --echo-map-id-uniq - gencode.v16.genes.bed \ > answer.viaBedmap.bed
answer.viaBedmap.bed is a sorted BED file with results in the following layout:
[ BLAT result 1 ] | [ gene 1-1 ] ; [ gene 1-2 ] ; ... ; [ gene 1-i ] ... [ BLAT result N ] | [ gene N-1 ] ; [ gene N-2 ] ; ... ; [ gene N-j ]
[ BLAT result n ] is a BED element from the BLAT result. The pipe character delimits this element from a semi-colon-delimited list of unique GENCODE gene IDs (
gene_name fields from the original GTF) that map to the BLAT element.
If you want the entire gene element, replace
--echo-map. There are other
--echo-map-* operators available; the documentation explains their usage.
The default mapping criterion is one or more bases of overlap between the BLAT and GENCODE elements. You can change this criteria to be more stringent with the appropriate
If you don't need to know how the gene IDs associate with each BLAT result — you just want the gene IDs that overlap the BLAT result population — a second way to do this is to instead use BEDOPS
$ psl2bed < searchResults.psl \ | bedops --element-of -1 gencode.v16.genes.bed - \ > answer.viaBedops.bed
This will just return BED-formatted GENCODE gene elements that overlap the BLAT coordinates by one or more bases. You don't get the per-element association information with a
bedops-based approach, but it will run faster than
Which tool to use between
bedops will likely depend on what question you are trying to answer.
answer.*.bed files calculated with either approach are sorted, which means that they are ready to use with other downstream BEDOPS-based analyses. This means that you can do more fast and memory efficient set or mapping operations with this result and BEDOPS tools, and no further sorting is necessary.