Question: Closest gene to set of intervals
0
gravatar for rbronste
22 months ago by
rbronste240
rbronste240 wrote:

Hi,

I know this question has been asked before, but have some additional variations:

I have a bed file of regions, want to see closest gene.

  1. Need a file containing one line for each gene instead of what table browser outputs, which is several lines for each refSeq gene. Not entirely sure how to get this.
  2. Want output to be the gene names themselves instead of a list of intervals, I guess a further awk operation to get there?
  3. Want closest gene irrespective of which direction it is from interval.

Any opinions appreciated, thanks!

bedops bedtools • 824 views
ADD COMMENTlink modified 22 months ago by Rohit1.3k • written 22 months ago by rbronste240
1

Use BEDOPS closest-features --closest --no-ref to get the closest element (ties go upstream) and leave out the reference element (i.e. reporting the nearest element only). Pipe to cut to get the ID from the fourth column:

$ closest-features --closest --no-ref regions.bed refseq.bed | cut -f4 - > nearest-refseq-IDs.txt

If your annotations are in a different format, you can convert directly:

$ closest-features --closest --no-ref regions.bed <(gff2bed < refseq.gff) | cut -f4 - > nearest-refseq-IDs.txt
ADD REPLYlink modified 22 months ago • written 22 months ago by Alex Reynolds28k

Very helpful thanks. Seems like I am constantly pulling refSeq files from table browser that have 35K+ lines in the BED, is there an easy way on there to just list every gene once? Seems like I am getting alternative transcript locations. Also does this command restrict the distance at which the element can exist from internal in my regions.bed file, I ran the refSeq file against my 500 intervals but only got one gene. Thanks again!

ADD REPLYlink written 22 months ago by rbronste240
1

You could pipe ids to sort | uniq to get a sorted and unique list. You'd lose any associations with coordinates but then you aren't getting those, anyways.

ADD REPLYlink written 22 months ago by Alex Reynolds28k

Great, that works well. Wondering if there would be any reason for the regions.bed file to be scanned only across first interval for nearby gene?

ADD REPLYlink written 22 months ago by rbronste240

I'm sorry, but I don't quite understand. Can you describe what you mean?

ADD REPLYlink written 22 months ago by Alex Reynolds28k

So I run the following command:

closest-features --closest --no-ref Atf2_sorted.bed refGene_062817_bedops_sorted.bed | cut -f4 - > nearest-refseq-IDs.txt

The Atf2_sorted.bed file has ~700 intervals however only the first one is considered when comparing to the refGene annotation - and I get an output of the closest gene to that first interval only.

ADD REPLYlink written 22 months ago by rbronste240

I think its actually the format of my BED regions file, is there a specific type of BED that bedops accepts?

ADD REPLYlink written 22 months ago by rbronste240
1

BEDOPS takes input that is a sort-bed-sorted, minimally three-column BED file, in most cases.

Are you working with files from Microsoft tools like Word, Excel or Windows? You'd need to strip any carriage return characters (\r) with tr or dos2unix etc. Having those characters can cause parsing problems.

You could use cat -te <file> to verify that your file is tab-delimited and has proper line endings.

ADD REPLYlink modified 22 months ago • written 22 months ago by Alex Reynolds28k

No there are not edited in MsOffice at all, learned my lesson with that a while back. However when I do:

cat -te <file>

I get a jumble that doesn't look organized. This is confirmed with:

wc -l <file>

Which shows me that I have zero lines in my file though I have a bunch there. It may have something to do that this is an export from R and may not be perfectly processed as a BED file. Is there a quick way to take such a file and format it for BED format? Or see what the issue is? It has the right number of columns and looks TAB delimited. Thanks.

ADD REPLYlink written 22 months ago by rbronste240

On re-reading, if wc -l returns no lines, that's definitely odd. Instead of posting the result of cat -te | head, which will likely just spit out the entire file, can you please post what you're using in R to export your data?

ADD REPLYlink written 22 months ago by Alex Reynolds28k

I am using outputting a BED file (or maybe now it seems as though its BED-file like) from DiffBind in the following way:

> report <- dba.report(tamoxifen, th=.05,                  DataType=DBA_DATA_FRAME,method=DBA_EDGER)

> scores <- -10*(log10(report$FDR))

> sites  <- cbind(report[,1:3],rownames(report),scores)

> gains  <- report$Fold > 2
> losses <- report$Fold < -2

​> write.table(sites[gains,],
              "DBsitesGains.bed", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)

> write.table(sites[losses,],
              "DBsitesLosses.bed", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)
ADD REPLYlink written 22 months ago by rbronste240

I don't see anything obvious that suggests a problem. If you want to post DBsitesGains.bed or the losses file somewhere I can download them, I can take a look at those directly.

ADD REPLYlink written 22 months ago by Alex Reynolds28k

Actually I just figured out its only running the first line of my regions.bed file against the refSeq.bed, and giving me that one closest gene. Strange.

ADD REPLYlink written 22 months ago by rbronste240

One quick question on this approach, if in my regions.bed there is a column indicating the FDR of these particular called regions, is there a good way to maintain that column in the final nearest-refseq-IDs.txt output here? Thanks!

ADD REPLYlink written 8 months ago by rbronste240
1

If you know how many columns are in regions.bed, and that number is consistent throughout the entire regions file, you can take out the --no-ref option and cut two columns.

For example, say your regions.bed contains five columns and the FDR is in the score column (fifth column, per BED specification). The refSeq IDs are still in the fourth column of the annotation file.

Here, you could cut from the output the fifth column (FDR) and the ninth column (5+4=9, refSeq ID):

$ closest-features --closest regions.bed <(gff2bed < refseq.gff) | cut -f5,9 - > nearest-FDRs-with-refseq-IDs.txt

The output nearest-FDRs-with-refseq-IDs.txt will contain the FDR score from regions.bed and the refSeq ID of the refSeq element closest to that region.

You can adjust this approach, depending on your regions.bed file's characteristics and what columns you want.

ADD REPLYlink modified 8 months ago • written 8 months ago by Alex Reynolds28k

Thank you! Very helpful!

ADD REPLYlink written 8 months ago by rbronste240

One other quick question, how can I maintain a header through sort-bed and closest-features - as it gives me an error indicating my first line in closest-features is not a chromosome start even when I have the --header flag on. Thanks!

ADD REPLYlink written 8 months ago by rbronste240

If you're using --header and it isn't skipping the header, then that might be a bug. Please file an issue on the Github site if that's the case and we'll look into it.

Otherwise, you could strip the header manually and reapply it at the end via standard input/output streams:

$ closest-features --closest <(tail -n+2 regions.bed) <(gff2bed < refseq.gff) | cut -f5,9 - | cat <(head -1 regions.bed) - > nearest-FDRs-with-refseq-IDs.txt

That's pretty ugly, I guess, but it should work. I'm not sure if the header columns would match up with the two-column output in this example, but perhaps tweak this, depending on what columns you want.

ADD REPLYlink modified 8 months ago • written 8 months ago by Alex Reynolds28k
0
gravatar for Rohit
22 months ago by
Rohit1.3k
California
Rohit1.3k wrote:

I am not sure if you have tried, bedtools closest already does everything (apart from name extraction) what you need. You can use the -D or -d options with -ref (with respect to reference) to check which is the closest with/without direction involved.

ADD COMMENTlink modified 22 months ago • written 22 months ago by Rohit1.3k

Does this negate the need for a -b file and pulls straight from a reference genome? Or you still need a BED of TSSs or whole genes etc?

ADD REPLYlink written 22 months ago by rbronste240

You would need a bed file of gene boundaries for this

ADD REPLYlink modified 22 months ago • written 22 months ago by Rohit1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 907 users visited in the last hour