Some questions about 3'UTR regions from rat 6.0 fasta and gtf files
4 months ago

I met a problem about miRNAseq and the miRNA target genes prediction.

I know the basic workflow and I tried the first method:

I downloaded the 3'UTR fasta files (version: rat 6.0) from ensembl biochart and UCSC respectively.

And I used each of these files and predicted different numbers of target genes using miranda in linux.

However I were not satisfied with all the results on the numbers of target genes(with setting parameters below:).

miranda rno_DEGs.fasta  GCF_000001895.5_Rnor_6.0_3'UTR.fa -sc 150 -en -30 -strict | grep ">>" >
rno_VS_NCBI_END.txt


Can I adjust the two parametesr:-sc 150 -en -30 to low standard ??? the default parameter is: -sc 140 -en 1

Here is the fasta and gtf files links: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/895/GCF_000001895.5_Rnor_6.0/GCF_000001895.5_Rnor_6.0_genomic.fna.gz

## annotation files：

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/895/GCF_000001895.5_Rnor_6.0/GCF_000001895.5_Rnor_6.0_genomic.gff.gz

So I tried the second method:

I used the rat 7.0 3'UTR fasta file from ensembl biochart and run the same flowchart and what suprised me was I got the most numbers of target genes and I thought it was a good result.

My question is my mRNA counts was done by rat 6.0 fasta I described above . So I don't know if it is suitable for me to use different versions of fasta files to analysis target genes ?

If not, I also have the other one method:

I think I can extract 3'UTR sequences from rat 6.0 fasta and gtf files. But I didn't find the coordinations of 3'UTR and any its information from the 6.0 gtf file. I don't know it's why. And because of this reason, I have no idea how to extract 3'UTR information from fasta and gtf files then.

And I got the final method:

I communicate with the sequencing company. They told me they use the whole genomic fasta file as the 3'UTR sequence and to get the target genes prediciton. I still don't know why they do this step in this way ??? Can I do this?

I looked up many methods including using R biomart or other methods but most of them were not suitable for me.

So I really hope somebody could give me some advice or method. Vary thankful.

UTR miRNAseq
4 months ago
Shred ▴ 870

A quick script to get just 3' UTR in BED format

import sys

with open(sys.argv[1],'r') as gtf_file:
for line in gtf_file:
if line.startswith('#'):
continue
else:
fields = line.rstrip().split('\t')
if fields[2] == "three_prime_utr":
print(f"{fields[0]}\t{fields[3]}\t{fields[4]}\t{fields[6]}")


Launch this with

python3 script.py your_annotation.gtf > three_prime_utr.BED


Then use BEDtools getfasta to extract fasta of those regions

bedtools getfasta [OPTIONS] -fi <input FASTA> -bed three_prime_utr.BED

Thanks, sir. I am not familiar with the python. But I saw the words "three_prime_utr" . I don't know if it is that I should "three_prime_utr" in the GTF ?

That's how the 3'UTR regions are encoded inside the GTF file. https://www.ensembl.org/info/website/upload/gff.html