Question

Forum:Rfam v12 -New released- Doubts- How do you face it?

1

Entering edit mode

9.6 years ago

margxenscienculo ▴ 50

Hi everyone. The new version of Rfam v12 has been released the last friday.

ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/

http://rfam.xfam.org/help#tabview=tab0

I have several doubts, I read the "Readme" though. I wonder if someone can explain slightly how you handle to make the differents steps. How do you face the problem?

For example, How do you starts? Do you use the last script from Rfam v11 (rfam_scan.pl)? Or you download the whole rfamseq.txt? I wonder how you find the Rfam.fasta if they are not avaible? If someone can point out some little or huge advise I will appreciate.

I point out the differents parts of the "Readme".

(...)

(3) As of Rfam 12.0, we no longer provide FULL alignments for each family. As
the size of our full alignments grew, the overheads involved in creating,
storing and manipulating them became too great to support. Instead, we provide
full region lists, which contain the ENA sequence accession, start/end
coordinates and bitscore for each hit to a family. If you wish to build a FULL
alignment equivalent to those supplied in previous releases of Rfam, you may do
so by downloading the CM for a given family and the Rfam sequence database,
RFAMSEQ (or indeed you may choose to use your own set of sequences). This means
that some sections of the website are no longer available, such as the option to
download or vie the full alignment for a given family.

(...)

7) Due to the increasing size of the nucleotide sequence databases
    and the resulting increase in the size of our alignments we are
    now unable to provide complete sequence alignments and trees for
    our 5 largest families tRNA (RF0005), SSU (RF00177, RF01959,
    RF0160) and ultra conserved element uc_338 (RF02271). For these
    families we have provided a full alignment that is composed of
    SEED and genome sequences only. The entries for these families in
    the files: Rfam.fasta, Rfam.full and Rfam_full.tree are based on
    these reduced genome alignments.  We do however provide a fasta
    file containing the complete WGS+STD annotations for each family
    on our ftp site (see below for release files). The number of
    sequences annotated in the reduced genome alignments and complete
    WGS_STD alignments:

                  genome_alignment       WGS_STD_alignment
    RF00005    298470           2106268
    RF00177    7429           744528
    RF01959    7394           881056
    RF01960    425           65901
    RF02271    857           229907

(...)
4. FILES
As of Rfam 12.0
---------------
README                 - this file
COPYING                - some legal things
USERMAN                - a description of the Rfam flatfile formats
Rfam.tar.gz             - a concatenated set of Rfam covariance models in ascii INFERNAL 1.1 format
Rfam.seed.gz           - annotated seed alignments in STOCKHOLM format
Rfam.full_region.gz    - list of sequences which make up the full family
                         membership for each family. Fields are as follows:
                         1. RF00001 is the Rfam accession
                         2. EU093378.1 is the EMBL accession and version number
                         3. Start coordinate of match on sequence
                         4. End coordinate of match on sequence
                         5. Bitscore
                         6. E-value
                         7. CM start position
                         8. CM end position
                         9. If match is a truncated match to CM, this field is 1
                         10. Type is either seed or full
Rfam.seed_tree.tar.gz  - annotated tree files for each seed alignment [tarbomb]
Rfam.pdb.gz        - tab delimited mappings of pdb seqs to Rfam families.

database_files:
        alignment_and_tree.txt.gz
        clan.txt.gz
        clan_database_link.txt.gz
        clan_literature_reference.txt.gz
        clan_membership.txt.gz
        database_link.txt.gz
        db_version.txt.gz
        dead_clan.txt.gz
        dead_family.txt.gz
        family.txt.gz
        family_literature_reference.txt.gz
        family_ncbi.txt.gz
        features.txt.gz
        full_region.txt.gz
        html_alignment.txt.gz
        keywords.txt.gz
        literature_reference.txt.gz
        matches_and_fasta.txt.gz
        motif.txt.gz
        motif_database_link.txt.gz
        motif_family_stats.txt.gz
        motif_file.txt.gz
        motif_literature.txt.gz
        motif_matches.txt.gz
        motif_pdb.txt.gz
        motif_ss_image.txt.gz
        pdb_full_region.txt.gz
        rfamseq.txt.gz
        secondary_structure_image.txt.gz
        seed_region.txt.gz
        sunburst.txt.gz
        tables.sql
        taxonomy.txt.gz
        taxonomy_websearch.txt.gz
        version.txt.gz

(...)

Infernal microRNA ncRNA Rfam RNA • 5.0k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by margxenscienculo ▴ 50

0

Entering edit mode

Hello margxenscienculo!

It appears that your post has been cross-posted to another site: SEQanswers.

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Ok. Better annoying only one. :-/

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by margxenscienculo ▴ 50

0

Entering edit mode

I think you misunderstand the problem. It's not that the question itself is annoying, but that posting it twice may double the workload of people that are trying to help you. See Rule 8. Be Courteous to Other Forum Members, most relevantly:

One of the most impolite behaviors toward an online community is asking a question in multiple places at the same time. "Cross-posting", as this practice is called, can make two distinct online communities work through a solution for you when only one is needed; this is an abuse of forum members' time. If you have not received an answer and you believe that asking it in another place would get you one, provide a link back to the original discussion. Similarly, if you receive an answer in a different forum, report the answer to the original forum. Then, the people who helped you will know what the correct solution is and that you are no longer looking for it.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by matted 7.8k

score 0 · Answer 1 · 2017-01-04

FYI: For the Rfam.fasta.gz:

ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/README

This directory replaces the Rfam.fasta.gz file from previous releases. For each family there is an unaligned gzipped fasta file with all matching sequence regions from the Rfamseq database, which is derived from a WGS subset of ENA release 110.

score 0 · Answer 2 · 2017-05-16

0

Entering edit mode

6.9 years ago

gkuffel22 ▴ 100

I am also having some trouble utilizing Rfam. I am on their FTP site. I am setting up miRanalyzer and I would like to filter my mouse miRNA-Seq reads for ncRNAs and SNORNAs but I am not sure which file I should download from their site? Anyone know which file I should grab? I see the fasta files but there are so many.

ADD COMMENT • link 6.9 years ago by gkuffel22 ▴ 100

0

Entering edit mode

You should post this as a new question? Adding it as an answer on an old thread is not likely to get any traction.

ADD REPLY • link 6.9 years ago by GenoMax 141k