Question: How can I count read that falls into exon-intron and exon-exon junction?
1
gravatar for biplab
16 months ago by
biplab100
University of California, Davis
biplab100 wrote:

I have gtf file as follows

VI  ensembl gene    53260   54696   .   -   .   gene_id "YFL039C"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding";
VI  ensembl transcript  53260   54696   .   -   .   gene_id "YFL039C"; transcript_id "YFL039C"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ACT1"; transcript_source "ensembl"; transcript_biotype "protein_coding";
VI  ensembl exon    54687   54696   .   -   .   gene_id "YFL039C"; transcript_id "YFL039C"; exon_number "1"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ACT1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "YFL039C.1";
VI  ensembl CDS 54687   54696   .   -   0   gene_id "YFL039C"; transcript_id "YFL039C"; exon_number "1"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ACT1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YFL039C";
VI  ensembl start_codon 54694   54696   .   -   0   gene_id "YFL039C"; transcript_id "YFL039C"; exon_number "1"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ACT1"; transcript_source "ensembl"; transcript_biotype "protein_coding";
VI  ensembl exon    53260   54377   .   -   .   gene_id "YFL039C"; transcript_id "YFL039C"; exon_number "2"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ACT1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "YFL039C.2";
VI  ensembl CDS 53260   54377   .   -   1   gene_id "YFL039C"; transcript_id "YFL039C"; exon_number "2"; gene_name "ACT1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ACT1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YFL039C";

How can I use this gtf file to make a bed file which I will be able to use for counting reads that falls into exon-exon and exon-intron junction?

sequencing rna-seq next-gen • 755 views
ADD COMMENTlink modified 16 months ago by Alex Reynolds28k • written 16 months ago by biplab100

Since it looks like your mapping to genomic sequences, how (and why?) would you count reads to fall into exon-exon? you will need some pretty long reads and/or very small introns to get into such a situation. Moreover any read that spans a exon-exon junction will be definition also span (2) exon-intron junctions.

I can understand the exon-intron part and then Alex Reynolds will work nicely

ADD REPLYlink written 16 months ago by lieven.sterck5.8k
5
gravatar for Alex Reynolds
16 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Steps via awk and BEDOPS.

Build a BED file of exons:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz \
    | gunzip -c - \
    | awk '($3=="exon")' - \
    | gtf2bed - \
    | cut -f1-6 - \
    > gencode.v28.annotation.exons.c1t6.bed

Convert transcripts to merged exons:

$ awk -f transcripts2mergedExons.awk gencode.v28.annotation.exons.c1t6.bed > gencode.v28.annotation.mergedExons.c1t6.bed

Make an exon-intron list:

$ awk -f mergedExons2exonIntronList.awk gencode.v28.annotation.mergedExons.c1t6.bed > gencode.v28.annotation.exonsAndIntrons.c1t6.bed

Convert these to junctions:

$ awk -f exonIntronList2JunctionList.awk gencode.v28.annotation.exonsAndIntrons.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.c1t6.bed

Pad them, e.g. by 25 nt around the junction (adjust as needed):

$ bedops --everything --range 25 gencode.v28.annotation.exonsIntronJunctions.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed

Map reads to the padded junctions:

$ bedmap --echo --count --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) > answer.bed

The file answer.bed will contain the junction and the number of reads that map to — overlap with — the junction, by one or more bases.

Some links to Github Gists with listed awk scripts:

  • (transcripts2mergedExons.awk)
  • (mergedExons2exonIntronList.awk)
  • (exonIntronList2JunctionList.awk)
ADD COMMENTlink written 16 months ago by Alex Reynolds28k
2

Thank you so much. I will try this.

ADD REPLYlink written 14 months ago by biplab100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1111 users visited in the last hour