Convert GFF3 to splice junction BED file, for filtering variants multi-sample (RNA-seq) VCF
1
1
Entering edit mode
3.0 years ago
William ★ 5.3k

I am searching for a way to filter/label variants from a RNAseq VCF file that are within 10bp to splice junctions defined in a GFF3 gene model.

This is similar to how bcbio filters variants that are close to the splice junctions that were identified by STAR.

https://github.com/bcbio/bcbio-nextgen/blob/3cc2fa286903662a14baf411706dc8d9590355dc/bcbio/ngsalign/star.py#L291

https://github.com/bcbio/bcbio-nextgen/blob/3cc2fa286903662a14baf411706dc8d9590355dc/bcbio/rnaseq/variation.py#L186

But I just have the BAM files and the multi-sample VCF file. Not the (splice junction) sj.out intermediate files created by STAR.

Is there a good way to convert a GFF3 to a splice site BED file? So I can use that for filtering / labeling variants close to splice sites?

I do realize that this bed file would not contain "new/alternative" splice junctions found by STAR.

rna splice-junction vcf • 1.2k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
William ★ 5.3k

Found what I needed in an R pacakge and in another Biostars question/answer.

How To Write Data In A Granges Object To A Bed File.

https://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html

Resulting script.

Somehow the start positions in the granges object already where 0 based, so did not subtract 1bp extra for the start positions.

# Use GenomicFeatures to get the introns from a GFF3 or GTF
library(GenomicFeatures)
txdb <- makeTxDbFromGFF('input.gtf')
all.introns <- intronicParts(txdb)

gr <- all.introns

# Take the start of the introns and add 10bp slop   
df_start_splice_slop_10 <- data.frame(seqnames=seqnames(gr),
                 starts=start(gr)-10,
                 ends=start(gr)+10,
                 names=c(rep(".", length(gr))),
                 scores=c(rep(".", length(gr))),
                 strands=strand(gr))

# Take the ends of the introns and add 10bp slop 
df_stop_splice_slop_10 <- data.frame(seqnames=seqnames(gr),
                                      starts=end(gr)-10,
                                      ends=end(gr)+10,
                                      names=c(rep(".", length(gr))),
                                      scores=c(rep(".", length(gr))),
                                      strands=strand(gr))

# Concat the start and stop splice sites dataframes
df_splice_slop_10 <- rbind(df_start_splice_slop_10, df_stop_splice_slop_10)

# Format non-scientific numbers, without leading/trailing whitespace
df_splice_slop_10$starts <- format(df_splice_slop_10$starts, scientific = FALSE, trim = TRUE)
df_splice_slop_10$ends <- format(df_splice_slop_10$ends, scientific = FALSE, trim = TRUE)

# Write to bed file
write.table(df_splice_slop_10, file="input_gene_model_splice_slop_10.bed", quote=F, sep="\t", row.names=F, col.names=F)

Use bedtools sort and merge to merge the overlapping slop regions in the bed file caused by very small introns.

Then use bcftools filter to label variants within 10bp of splice junctions using this bed file.

ADD COMMENT

Login before adding your answer.

Traffic: 2026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6