Entering edit mode
6.3 years ago
baunruh
▴
10
I downloaded the "cDNA" and GFF3 for mus musculus from the ensembl website. https://www.ensembl.org/info/data/ftp/index.html
I built a bowtie index from bowtie2 using:
bowtie2-build Mus_musculus.GRCm38.90.fa Mus_musculus.GRCm38.90
Then I tried to run tophat2:
tophat -p 2 --b2-L 15 -G Mus_musculus.GRCm38.90.gff3 -o testmap_gtf Mus_musculus.GRCm38.90.fa testmap.fastq
which runs perfectly fine without the gff3 file but not with it. I did a bit of research and most people are saying it is a result of the annotation being different, however I downloaded these from the exact same source so I don't see why it would be different or how I could check it. Could somebody walk me through this please?
Then I got this error
[2017-11-17 14:01:37] Beginning TopHat run (v2.0.9)
-----------------------------------------------
[2017-11-17 14:01:37] Checking for Bowtie
Bowtie version: 2.1.0.0
[2017-11-17 14:01:37] Checking for Samtools
Samtools version: 0.1.19.0
[2017-11-17 14:01:37] Checking for Bowtie index files (genome)..
[2017-11-17 14:01:37] Checking for reference FASTA file
Warning: Could not find FASTA file /home/baunruh/RibosomeTEData/cDNAReference/Mus_musculus.GRCm38.90.fa.fa
[2017-11-17 14:01:37] Reconstituting reference FASTA file from Bowtie index
Executing: /apps/packages/bio/bowtie2/2.1.0/bowtie2-inspect /home/baunruh/RibosomeTEData/cDNAReference/Mus_musculus.GRCm38.90.fa > /home/baunruh/RibosomeTEData/testmap_gtf/tmp/Mus_musculus.GRCm38.90.fa.fa
[2017-11-17 14:01:41] Generating SAM header for /home/baunruh/RibosomeTEData/cDNAReference/Mus_musculus.GRCm38.90.fa
format: fastq
quality scale: phred33 (default)
[2017-11-17 14:01:49] Reading known junctions from GTF file
[2017-11-17 14:02:03] Preparing reads
left reads: min. length=25, max. length=34, 23328414 kept reads (32420 discarded)
[2017-11-17 14:04:26] Building transcriptome data files..
[2017-11-17 14:04:40] Building Bowtie index from Mus_musculus.GRCm38.90.fa
[FAILED]
Error: Couldn't build bowtie index with err = 1
Please do not use TopHat for new projects unless you have an absolute need to. Use STAR, BBMap, HISAT2 which are newer recommended programs.
If you are going to use an annotation file then you should not use just the cDNA sequence. You should get the sequence of the full genome. Coordinates in your GTF file are referring to the entire genome.
Sorry! I am new to this, apparently I was supposed to align them to the cDNA reference without the gtf then align the accepted_hits to the entire genome because I only want reads aligned to the CDNA. Does this sound right?
You could do it two ways. Either just align to cDNA without using GTF. Or use the whole genome/GTF with a special initial TopHat run which makes the transcriptome specific sequence index. Read about the second method on TopHat manual page (using TopHat section
--transcriptome-index <dir/prefix>
part).At tophat2 command, try not to write .fa extension