Question: Cuffmerge Is Looking For Contig Fasta Files That I Do Not Have!
0
gravatar for jobinv
6.0 years ago by
jobinv1.1k
Bergen, Norway
jobinv1.1k wrote:

I downloaded the full Homo_sapiens_Ensembl_GRCh37.tar.gz file from iGenomes (huge file, 17 GB, but contains everything I've needed otherwise for my tuxedo suite, from genomes to bowtie indexes), to use with my RNA-Seq pipeline. I have assembled the transcripts using tophat, then followed up with cufflinks to find expression values. I am now trying to use cuffmerge on this as described, with the following command in python:

command = "cuffmerge -p 8 -o merged -g %s -s %s assembly_GTF_list.txt"%(genes,refsequence_folder)
os.system(command)

Where "refsequence_folder" directs to the "Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes" folder, which contains the following .fa files:

10.fa 11.fa 12.fa 13.fa 14.fa 15.fa 16.fa 17.fa 18.fa 19.fa 1.fa 20.fa 21.fa 22.fa 2.fa 3.fa 4.fa 5.fa 6.fa 7.fa 8.fa 9.fa MT.fa X.fa Y.fa

My problem is the cuffmerge works well until it suddenly tries to look for .fa files that are not in this folder. Here is an excerpt from the error messages that I get:

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000191.1{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000192.1{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/GL000193.1{.fa,.fasta} [...]

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG1007_PATCH{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG1032_PATCH{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HG104_HG975_PATCH{.fa,.fasta} [...]

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR10_1_CTG2{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR10_1_CTG5{.fa,.fasta} Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/HSCHR12_1_CTG1{.fa,.fasta}

My question: where do I find all these missing files, if even iGenomes does not provide them? Alternatively, how do I get cuffmerge to stop looking for them?

fasta cuffmerge • 3.4k views
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by jobinv1.1k

I can add that I also attempted this by concatenating all the .fa files into a single hg19.fa file, and then providing cuffmerge with that file instead of the full folder. It wasn't quite that easy to fool cuffmerge :)

ADD REPLYlink written 6.0 years ago by jobinv1.1k

What was the reference file you've used for tophat assembly?

ADD REPLYlink written 6.0 years ago by Fedor Gusev210

I used the Bowtie2 index files for Ensembl from http://cufflinks.cbcb.umd.edu/igenomes.html...

ADD REPLYlink written 6.0 years ago by jobinv1.1k
0
gravatar for JC
6.0 years ago by
JC7.9k
Mexico
JC7.9k wrote:

you will find your sequences in Ensembl: ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/

ADD COMMENTlink written 6.0 years ago by JC7.9k

Oh, perfect! Thanks!

ADD REPLYlink written 6.0 years ago by jobinv1.1k
0
gravatar for jobinv
6.0 years ago by
jobinv1.1k
Bergen, Norway
jobinv1.1k wrote:

I'm aware that this is not actually an answer, but I did not want to start a new question for this purpose. Apologies if this is the wrong way to do this.

I tried using the Ensembl sequences that JC linked to, which contained many of the sequences that cuffmerge was looking for. However, it is still not finding all the sequences that it needs, it seems:

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000191.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000192.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000193.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000194.1{.fa,.fasta}

[...]

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000241.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000242.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000243.1{.fa,.fasta}

Warning: cannot find genomic sequence file /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/GL000247.1{.fa,.fasta}

Any tips?

ADD COMMENTlink written 6.0 years ago by jobinv1.1k

you need to split the sequences in ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.71.dna.nonchromosomal.fa.gz each one in a separate *.fasta file under your directory /data/reference/Homo_sapiens/Ensembl/GRCh37/Sequence/Chromosomes/extended/

ADD REPLYlink written 6.0 years ago by JC7.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1874 users visited in the last hour