Question

Finding Ribo-seq reads in annotated UTRs

0

Entering edit mode

6.1 years ago

jpcchoy • 0

Hi, everyone.

I would like to work out which of my Ribo-seq reads map to UTRs, but haven't been able to find any annotation files (GTF/GFF3) including UTRs or ORFs.

I was hoping to align my FASTQ reads against a Gencode.v30 reference, then take the SAM file and use htseq-count on it with a Gencode.v30 GTF/GFF3, but it looks as if those annotations don't include UTRs/ORFs. Have I misunderstood anything?

I would love if anyone could provide me with some pointers! I'm so sorry if I haven't found a piece of germane documentation. I've searched around a lot and can't find any tips.

Thanks so much.

Ribo-Seq UTR Alignment HTSeq Annotation • 1.6k views

ADD COMMENT • link 6.1 years ago by jpcchoy • 0

1

Entering edit mode

Are you sure Gencode annotation doesn't include UTRs?

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz
zcat gencode.v30.annotation.gtf.gz | grep "UTR" | head -n5

chr1  HAVANA  UTR 65419   65433   .   +   .   gene_id "ENSG00000186092.6"; transcript_id "ENST00000641515.2"; gene_type "protein_coding"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_name "OR4F5-202"; exon_number 1; exon_id "ENSE00003812156.1"; level 2; protein_id "ENSP00000493376.2"; tag "RNA_Seq_supported_partial"; tag "basic"; havana_gene "OTTHUMG00000001094.4"; havana_transcript "OTTHUMT00000003223.4";
chr1  HAVANA  UTR 65520   65564   .   +   .   gene_id "ENSG00000186092.6"; transcript_id "ENST00000641515.2"; gene_type "protein_coding"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_name "OR4F5-202"; exon_number 2; exon_id "ENSE00003813641.1"; level 2; protein_id "ENSP00000493376.2"; tag "RNA_Seq_supported_partial"; tag "basic"; havana_gene "OTTHUMG00000001094.4"; havana_transcript "OTTHUMT00000003223.4";
chr1  HAVANA  UTR 70006   71585   .   +   .   gene_id "ENSG00000186092.6"; transcript_id "ENST00000641515.2"; gene_type "protein_coding"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_name "OR4F5-202"; exon_number 3; exon_id "ENSE00003813949.1"; level 2; protein_id "ENSP00000493376.2"; tag "RNA_Seq_supported_partial"; tag "basic"; havana_gene "OTTHUMG00000001094.4"; havana_transcript "OTTHUMT00000003223.4";
chr1  ENSEMBL UTR 69055   69090   .   +   .   gene_id "ENSG00000186092.6"; transcript_id "ENST00000335137.4"; gene_type "protein_coding"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_name "OR4F5-201"; exon_number 1; exon_id "ENSE00002319515.2"; level 3; protein_id "ENSP00000334393.3"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.4";
chr1  ENSEMBL UTR 70006   70108   .   +   .   gene_id "ENSG00000186092.6"; transcript_id "ENST00000335137.4"; gene_type "protein_coding"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_name "OR4F5-201"; exon_number 1; exon_id "ENSE00002319515.2"; level 3; protein_id "ENSP00000334393.3"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.4";

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Yes! I had not noticed that h.mon, but you're absolutely right. Unfortunately, I've run htseq-count in the following way and all of my sequences map to 0 features...

htseq-count -f bam -r pos sorted.bam gencode.v30.annotation.gtf

All I get is this:

__no_feature 1993081 
__ambiguous 0 
__too_low_aQual 31427753 
__not_aligned 2662011 
__alignment_not_unique 0

I got my sorted.bam from rsemoutput.transcript.bam, which I sorted with samtools sort. Any idea what might be going wrong? I haven't found any hint of what's going on in RSEM/SAMtools/HTSeq documentation, and would hugely appreciate any help.

ADD REPLY • link 6.1 years ago by jpcchoy • 0

1

Entering edit mode

If you want to count reads over utr region, try featureCounts -t UTR -f. I don't know if it will work, but its worth a shot.

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

I just tried this a few times with different parameters, and every time I get a

Segmentation fault (core dumped)

Do you know why this might be?

ADD REPLY • link 6.1 years ago by jpcchoy • 0

0

Entering edit mode

Context:

Process BAM file rsemout.transcript.sorted.bam...    
Single-end reads are included.  
Assign alignments to features...        
Segmentation fault (core dumped)