I want to calculate repeats expression in my RNAseq data. I've obtained bam files using TopHat and now I need gtf file for repeats to calculate the counts. Where can I download it?
Remember to get a GTF file that matches your genome. If your genome came from Ensembl then you need to get the GTF from ensembl. Chromosome identifiers may otherwise not match.
BTW: Repeat tracks are under "Variation and Repeats" group in UCSC table browser.
Go to UCSC
and under Tools > Table Browser
Choose your genome and track (ideally RefSeq genes), and select "Output format: GTF"
I need, repeats, why RefSeq genes?
Is it correct to choose 'Variation and Repeats' as group and 'RepeatMasker' as track?
Yes. See my comment above.
Thank you! I used mouse mm10 genome for TopHat and will use mm10 here again.
Did the genome come from UCSC or Ensembl or someplace else? Also keep in mind the "multi-hits" setting for TopHat. Since you are interested in repeats that setting may affect your results significantly.
Actually I downloaded an archive with genome, Bowtie2 indexes and other files here: ftp://ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/
So it is UCSC as far as I understand
That is correct. So you are fine with getting the repeats GTF from UCSC.
Thank you for your help!
I have a similar project, working on SSR repeats. Could you please kindly tell me what is your workflow for doing the work?
I simply use tophat2 to map the reads to reference genome. Then I sort my reads using samtools and use htseq-count to obtain counts from bam file. On this stage I needed gtf file we discussed here. Then you can the apply any normalization to counts, I prefer DeSeq. Let me know if still you have questions.
Thank you very much for your explanation. As you mentioned "repeat" in the title of your question, I thought that you have a specific way for surveying these regions. Now, I found that you follow the common way.