trouble: getting gene names from chromosome location using intersect function in bedtools
1
0
Entering edit mode
3 months ago
Sky ▴ 10

Hello, I am having trouble with a process that I thought was going to be very simple. I performed a DiffBind analysis with my ChIP-seq datasets. The output gave me the chromosome location but now I would like to know the gene names.

I downloaded a reference dataset from USCS using the following commands:

wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
gunzip gencode.v48.annotation.gtf.gz

I then tried to use the intersect command in bedtools (I know bedtools can handle a gtf format as long #as it is tab seperated):

bedtools intersect -a my_file_of_interest.bed -b gencode.v48.annotation.gftf > output_with_gene_names.bed

But the output bed file still just lists the chromosome location and no gene names. Can anyone provide some guidance? I have tried to manipulate the refence dataset so the first column is chromosome, the second column is the start, and the third column is the end.

My file of interest has the following format:

chr16   4936776 4937176
chr12   52147884    52148284
chr21   41507488    41507888
chr1    31413259    31413659
chr13   34348350    34348750
chr1    94875031    94875431
chr2    113157454   113157854

The reference file looks like:

##description: evidence-based annotation of the human genome (GRCh38), version 48 (Ensembl 114)                             
##provider: GENCODE                             
##contact: gencode-help@ebi.ac.uk                               
##format: gtf                               
##date: 2025-01-19                              
chr1    HAVANA  gene    11121   24894   .   +   .   gene_id "ENSG00000290825.2"; gene_type "lncRNA"; gene_name "DDX11L16"; level 2; tag "overlaps_pseudogene";
chr1    HAVANA  transcript  11121   14413   .   +   .   gene_id "ENSG00000290825.2"; transcript_id "ENST00000832824.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-260"; level 2; tag "TAGENE";
chr1    HAVANA  exon    11121   11211   .   +   .   gene_id "ENSG00000290825.2"; transcript_id "ENST00000832824.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-260"; exon_number 1; exon_id "ENSE00004248723.1"; level 2; tag "TAGENE";
chr1    HAVANA  exon    12010   12227   .   +   .   gene_id "ENSG00000290825.2"; transcript_id "ENST00000832824.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-260"; exon_number 2; exon_id "ENSE00004248735.1"; level 2; tag "TAGENE";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000290825.2"; transcript_id "ENST00000832824.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-260"; exon_number 3; exon_id "ENSE00003582793.1"; level 2; tag "TAGENE";
chr1    HAVANA  exon    13453   14413   .   +   .   gene_id "ENSG00000290825.2"; transcript_id "ENST00000832824.1"; gene_type "lncRNA"; gene_name "DDX11L16"; transcript_type "lncRNA"; transcript_name "DDX11L16-260"; exon_number 4; exon_id "ENSE00004248730.1"; level 2; tag "TAGENE";
intersect chromosome location geneIDs bedtools • 8.9k views
ADD COMMENT
0
Entering edit mode
3 months ago
GenoMax 154k

If you just need gene names added to the BED file of interest then this will require conversion of the GTF to BED to extract things of interest and then doing the intersect operation. See (with bedops) --> Filter a BED file based on genome coordinates for gene names

If you want the entire lines from GTF then this will work --> bedtools intersect mistakes

ADD COMMENT

Login before adding your answer.

Traffic: 4363 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6