Cuffmerge Is Looking For Contig Fasta Files That I Do Not Have!
2
0
Entering edit mode
10.9 years ago
jobinv ★ 1.1k

I downloaded the full Homo_sapiens_Ensembl_GRCh37.tar.gz file from iGenomes (huge file, 17 GB, but contains everything I've needed otherwise for my tuxedo suite, from genomes to bowtie indexes), to use with my RNA-Seq pipeline. I have assembled the transcripts using tophat, then followed up with cufflinks to find expression values. I am now trying to use cuffmerge on this as described, with the following command in python:

command = "cuffmerge -p 8 -o merged -g %s -s %s assembly_GTF_list.txt"%(genes,refsequence_folder)
os.system(command)

Where "refsequence_folder" directs to the "Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes" folder, which contains the following .fa files:

10.fa 11.fa 12.fa 13.fa 14.fa 15.fa 16.fa 17.fa 18.fa 19.fa 1.fa 20.fa 21.fa 22.fa 2.fa 3.fa 4.fa 5.fa 6.fa 7.fa 8.fa 9.fa MT.fa X.fa Y.fa

My problem is the cuffmerge works well until it suddenly tries to look for .fa files that are not in this folder. Here is an excerpt from the error messages that I get:

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000191.1{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000192.1{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000193.1{.fa,.fasta} [...]

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG1007_PATCH{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG1032_PATCH{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG104_HG975_PATCH{.fa,.fasta} [...]

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR10_1_CTG2{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR10_1_CTG5{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR12_1_CTG1{.fa,.fasta}

My question: where do I find all these missing files, if even iGenomes does not provide them? Alternatively, how do I get cuffmerge to stop looking for them?

cuffmerge fasta • 4.7k views
ADD COMMENT
0
Entering edit mode

I can add that I also attempted this by concatenating all the .fa files into a single hg19.fa file, and then providing cuffmerge with that file instead of the full folder. It wasn't quite that easy to fool cuffmerge :)

ADD REPLY
0
Entering edit mode

What was the reference file you've used for tophat assembly?

ADD REPLY
0
Entering edit mode

I used the Bowtie2 index files for Ensembl from http://cufflinks.cbcb.umd.edu/igenomes.html...

ADD REPLY
0
Entering edit mode
10.9 years ago
JC 13k

you will find your sequences in Ensembl: ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/

ADD COMMENT
0
Entering edit mode

Oh, perfect! Thanks!

ADD REPLY
0
Entering edit mode
10.9 years ago
jobinv ★ 1.1k

I'm aware that this is not actually an answer, but I did not want to start a new question for this purpose. Apologies if this is the wrong way to do this.

I tried using the Ensembl sequences that JC linked to, which contained many of the sequences that cuffmerge was looking for. However, it is still not finding all the sequences that it needs, it seems:

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000191.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000192.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000193.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000194.1{.fa,.fasta}

[...]

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000241.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000242.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000243.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000247.1{.fa,.fasta}

Any tips?

ADD COMMENT
0
Entering edit mode

you need to split the sequences in ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.71.dna.nonchromosomal.fa.gz each one in a separate *.fasta file under your directory /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/

ADD REPLY

Login before adding your answer.

Traffic: 2224 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6