Question

RNA-seq of 3'-UTRs

0

Entering edit mode

4.2 years ago

ntsopoul ▴ 60

Hi, I would like to align trimmed and filtered FASTQ files to the mm9 (mus musculus) reference genome to analyze 3'-UTR via STAR aligner.

My downstream applications did not work and I wonder whether I have used the right genome and the right annotation file.

I downloaded and used from ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/

Mus_musculus.GRCm38.dna_sm.primary_assembly.fa (DNA, FASTA) Mus_musculus.NCBIM37.67.gtf

would it be better to use cDNA? Mus_musculus.NCBIM37.67.cdna.all.fa.gz. (cDNA, FASTA)?

Thanks for helping

RNA-Seq Assembly rna-seq alignment • 1.4k views

ADD COMMENT • link 4.2 years ago by ntsopoul ▴ 60

0

Entering edit mode

What downstream application did you use and what was the error it produced ? cDNA would also contain 5'-UTR and coding regions. I think it would be better to extract the 3'-UTR regions from the genome using the gtf file and run salmon on them to count the reads mapping to these 3'-UTRs.

You can also try featureCounts and specify "three_prime_utr" as the feature to count on your sam/bam files generated by STAR

ADD REPLY • link 4.2 years ago by ashish ▴ 680

0

Entering edit mode

Hi and thanks for the quick answer! the downstream application was APAlyzer (https://bioconductor.org/packages/release/bioc/html/APAlyzer.html). It is a program to determine the polyadenylation site of a gene (many genes can have several polyadenylation sites) It takes your BAM files and compares them against a list of known APA sites to find out which polyA-site was used by which gene. However, I get back a data frame full of NA and 0. Since the pipeline worked before with BAM files that I downloaded from GEO I wanted to repeat the same with other files that I have in FASTQ format. So I trimmed and filtered with trimmomatic, checked quality and aligned the reads with STAR. To this point it seems to work... I was not sure whether I used the right genome for alignment. Can't I use the whole genome for alignment?

ADD REPLY • link 4.2 years ago by ntsopoul ▴ 60

0

Entering edit mode

I think I found the problem. The list of known APA sites has the chromosome names in UCSC format e.g. "chr1" and my alignment produced chromosome names with ensemble format e.g. "1". Is there something else that I have to be aware of if I use the one or the other format?

Thanks!

ADD REPLY • link 4.2 years ago by ntsopoul ▴ 60

0

Entering edit mode

You have already used the whole genome for alignment but you used Mus_musculus.GRCm38.dna_sm.primary_assembly.fa which is soft masked genome assembly, If there is no particular reason to use masked assembly you should use Mus_musculus.GRCm38.dna.primary_assembly.fa which has no masking.

There are some nomenclature differences in the files present on different websites but everything else is pretty much the same and please use add comment/reply to update.

ADD REPLY • link 4.2 years ago by ashish ▴ 680

0

Entering edit mode

thanks a lot! Where you think the masking could cause problems?

ADD REPLY • link 4.2 years ago by ntsopoul ▴ 60

0

Entering edit mode

I was wrong above in suggesting to use the unmasked genome. See this.

STAR does not discriminate between soft-masked (lowercase) or unmasked genome, I am not sure about other aligners. So it should not create any differences.

ADD REPLY • link 4.2 years ago by ashish ▴ 680

0

Entering edit mode

ntsopoul please stop using the answer field for discussions. Use ADD REPLY and ADD COMMENT. That keeps the thread logically organized.

ADD REPLY • link 4.2 years ago by ATpoint 84k