Question

Should I use Gencode annotation for RNA-seq alignment considering the Pseudoautosomal region in genome annotation file?

0

Entering edit mode

2.1 years ago

ccfpwll • 0

Update My question now becomes: How will unlocalized sequence in the genome annotation file affect RNA-seq analysis?

I have read more about the genome annotation part, and checked the gene counts output from featureCounts. What I found was, the .gtf file (gencode primary assembly) contained 120 more lines than the count matrix. I used gene_name to group counts, so reads mapped to some gene ids that correspond to the same gene symbol were probably added up together.

This leads me to think about another question: for genes that exist on both the genome and unlocalized sequence (not haplotypes), will the read mapping and quantification be accurate? Can I trust the results? (I am doing RNA-seq analysis, using STAR as aligner and featureCounts to assign reads).

One such example gene is Ccl27a (using gencode grcm25, mm10).

In the STAR manual, it's recommended to use the Gencode annotation.

Examples of acceptable genome sequence ?les: ? ENSEMBL: ?les marked with .dna.primary.assembly, such as: ftp://ftp.ensembl. org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_ assembly.fa.gz ? GENCODE: ?les marked with PRI (primary). Strongly recommended for mouse and human: http://www.gencodegenes.org/.

However, I noticed that from the Gencode FAQ page, it says that

The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.

Does this mean, that now the X and Y PAR regions will be "repeats" in the gtf annotation file, and reads mapped to these region by STAR will be considered as multi-mapping (then get discarded by tools like featureCounts in gene quantification)?

There's probably not many genes in that region, but I do feel puzzled and would appreciate any clarification.

Thanks!

RNA-seq GenomeAnnotation unlocalizedSequence • 857 views

ADD COMMENT • link 2.1 years ago by ccfpwll • 0

0

Entering edit mode

I am just thinking aloud, but in the reference genome it will appear twice anyway regardless of the GTF, right? So it is a multimapper anyway, no?

ADD REPLY • link 2.1 years ago by ATpoint 81k

0

Entering edit mode

Yes, it will. For the mouse genome I found there are a little over 100 such genes. But it still has (very small # of) raw counts. However, the raw counts of some of these genes are not small at all (compared to the mean value in samples).

I checked a few of these gene symbols that corresponds to multiple gene ids on a genome browser. Some have very low genome mappability (for example this Ccl27a has pseudogenes, some has sequence that blast to other genes), but still there are others do have good mappability.

I do get confused if in general I can trust the counts for these genes.

(Unfortunately, I've deleted files that can be used to visualize the mapping. I will check the mapping next time when available)

ADD REPLY • link 2.1 years ago by ccfpwll • 0