Question

A question for using data in ENCODE polyA plus RNA-seq for exon level expression count and never get a normal resultsrmal results

0

Entering edit mode

20 months ago

pxd • 0

Dear all, I use the data in ENCODE Reference genome and want to calculate the expression level for every exons to do the research about alternative splicing, but during the data processing for polyA plus RNA-seq, I met many problems and I guess the major problems is that I use the wrong gtf or fasta files so I checked the ENCODE bulk-RNA seq pipeline and try their ref but still don't work, the data I chose and methods are as follows: Data:ENCODE A569 cell lines polyA plus rnaseq, the link is https://www.encodeproject.org/experiments/ENCSR000CON/, to save time I chose the ENCFF123MIB processed bam data, for it can not be directly used for samtools index, I first sort it and the commands are as follows:

samtools sort -@ 8 ENCFF123MIB.bam > ENCFF123MIB.sort.bam

samtools index -@ 8 ENCFF123MIB.sort.bam

htseq-count -i gene_id -i exon_number --additional-attr gene_name -t exon --nonunique all ENCFF123MIB.bam gencode.v29.primary_assembly.annotation_UCSC_names.gtf >count_exon.tsv

And unfortunately these commands will end at at the time 2619444 GFF lines processed and then do not work.

So I think maybe I was wrong from the start so I download the raw data ENCFF000EJV.fastq and ENCFF000EJJ.fastq(two reads for one replicate) and start from the alignment, and the command I used are:

hisat2_extract_exons.py ../../gencoderef/gencode.v41.annotation.gtf > gencode.v41.annotation.exon 
hisat2_extract_splice_sites.py ../../gencoderef/gencode.v41.annotation.gtf > gencode.v41.annotation.ss
hisat2-build -p 6 --exon gencode.v41.annotation.exon --ss gencode.v41.annotation.ss gencode.v41.transcripts.fa index.gtf > hisat2_build.log 2>&1 &
hisat2 -p 8 -x index.gtf -1 ENCFF000EJJ.fastq -2 ENCFF000EJV.fastq -S ./A549_rep_1.sam > hisat2_rep1_sam.log 2>&1 &

And then I got the sam files but the accuracy is not so good and I use the same way for exon level analysis using htseq and samtools, but the results were so bad. The accuracy are as follows 113588758 reads; of these: 113588758 (100.00%) were paired; of these: 37999564 (33.45%) aligned concordantly 0 times 21929846 (19.31%) aligned concordantly exactly 1 time

53659348 (47.24%) aligned concordantly >1 times
----
37999564 pairs aligned concordantly 0 times; of these:
  102625 (0.27%) aligned discordantly 1 time
----
37896939 pairs aligned 0 times concordantly or discordantly; of these:
  75793878 mates make up the pairs; of these:
    63955134 (84.38%) aligned 0 times
    3339152 (4.41%) aligned exactly 1 time
    8499592 (11.21%) aligned >1 times

71.85% overall alignment rate

and the rusults for exon expression are Warning: xxxxx reads with missing mate encountered. almost most of the reads are missing. __no_feature 41053418 __ambiguous 0 __too_low_aQual 109403140 __not_aligned 19140617 __alignment_not_unique 106185767 I choose many data to do but every time I failed and want some help.

exon reference analysis htseq genome ENCODE • 579 views

ADD COMMENT • link updated 20 months ago by i.sudbery 19k • written 20 months ago by pxd • 0

score 0 · Answer 1 · 2022-08-05

I think the problem you are having here is that your are either selecting alignments to the transcriptome (in the case you downloaded the BAM file) or generating alignments to the transcriptome (in the case where you used hisat on the RAW data).

You are then using HTSeq to count the alignments to exons. However, HTSeq expects alignments to be to the genome, and in particular, I'd put money on the GFF you are passing to HTSeq having coordinates relative to the transcriptome, rather than to the genome.

So, to make your first command work, use ENCFF687OMX rather than ENCFF123MIB.

Or to go from the raw data, build your hisat index on hg38.fa, rather than gencode.v41.transcripts.fa.