all coding regions .bed file hg38 Whole Genome Sequencing
1
1
Entering edit mode
2.1 years ago
cocchi.e89 ▴ 190

Quick question: is there out there a .bed (or similar) files that span over all coding regions in hg38 coordinates? I need to analyze some WGS samples

Thank you very much in advance for any help!

wgs coding bed • 4.5k views
ADD COMMENT
0
Entering edit mode

You could download the hg38 GTF file from GENCODE and extract relevant columns and records from it.

ADD REPLY
4
Entering edit mode
2.1 years ago
wget -O - "http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz" |\
gunzip -c | grep 'transcript_type "protein_coding"' |\
awk '($3=="exon") {printf("%s\t%s\t%s\n",$1,int($4)-1,$5);}' |\
sort -T . -t $'\t' -k1,1 -k2,2n | bedtools merge

?

ADD COMMENT
0
Entering edit mode

thanks so much, I was wondering, what are the regions that are filtered out between line 2 and line 3? (regions that are flagged as protein_coding but not exons?

ADD REPLY
0
Entering edit mode

You should download and take a look at the GTF file. Explore it so you understand what the above command does with real data.

ADD REPLY
0
Entering edit mode

I did and de-piped command, mine was more a theoretical question

ADD REPLY
1
Entering edit mode

Right, then what are the unique values you see in $3 when transcript_type is protein_coding? That should tell you what non-exonic regions in protein coding transcripts are.

ADD REPLY
0
Entering edit mode

Don't the exon entries contain non-coding UTRs?

ADD REPLY
0
Entering edit mode

For the future bioinformaticians who land on this page: Please note that the release 35 in no more the latest release. So you may want to update the link to the gtf.gz file in the script above. You can find out the latest release by inspecting: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/. For example, the current release (as of May 2022) is release 40, thus the link is: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz.
By the way, the coordinates are on the human reference genome: hg38 (i.e.GRCh38).

ADD REPLY
0
Entering edit mode

Thank you for this. Can I ask, what is the best method to filter/extract only variants that appear in the exome (i.e. in the generated .bed file) from an annotated WGS vcf?

ADD REPLY
0
Entering edit mode

unrelated to the original question.

ADD REPLY
0
Entering edit mode

Apologies, I thought it was related as I imagined this is exactly what @cocchi.e89 was trying to do with the .bed file he was requesting.

ADD REPLY

Login before adding your answer.

Traffic: 1237 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6