Forum:What are the most popular NGS tools?
7
5
Entering edit mode
6.1 years ago
Emily 23k

We're developing a tool in Ensembl that will allow you to get GTF/GFF/FASTA files of Ensembl data in different formats, such that you can use them for different kinds of NGS analysis. We know that all of these tools have slightly different requirements for the file, even though they claim to be a standard format!

To do this, we're looking to generate a list of the most popular tools that people use – we'll then look into their requirements and make sure our tool can provide what's needed. Please could you reply with the tools you use, one tool per post, and upvote any posts which already mention the tools you use.

ensembl format NGS Forum • 2.4k views
ADD COMMENT
2
Entering edit mode

One can't really post a single tool name as it would be too short as a content.

In general I have observed that scientists like to use meaningful gene names. One of the most common needs is that of "how do I go from an ensemble gene/transcript name to a descriptive name".

A second important need is that of providing high quality and reliable information rather than a comprehensive all-inclusive one. At the same time it is important to understand the rules by which these were created. For example the annotation tracks in IGV are very informative but I don't know what basis were these selected by.

ADD REPLY
1
Entering edit mode

It would be nice to be able to download (FTP) mapping files for mapping between common identifiers. I realize there is BioMart, but I've not had the best experience with it and ended up using the ensebml MySQL when I'm trying to get the whole gene/transcript/protein set for a species.

ADD REPLY
2
Entering edit mode

@Emily: You should clarify your question to indicate if you are looking to get a list of all NGS tools or just those which have a strict dependency on reference data (which will be provided by Ensembl).

ADD REPLY
1
Entering edit mode

I thought that was self-evident.

ADD REPLY
2
Entering edit mode

Perhaps an edit to the title to clarify, "popular tools that require reference data"? Current title sounds like any NGS tool is acceptable.

ADD REPLY
2
Entering edit mode

At the minimum, Ensembl should provide a concatenated reference genome. I know there are "toplevel" FASTAs, but for human, it contains ALT contigs, which most users wouldn't want to use for general mapping. It is interesting that no official databases (I am talking about ucsc/ncbi/ensembl) provide concatenated GRCh37, which is partly why we see so many variants of GRCh37. GRC now provides concatenated GRCh38. I hope Ensembl can do the same, as Ensembl and GRC/UCSC have different naming.

ADD REPLY
1
Entering edit mode

Would these be offered as bundles (like what iGenomes provides)?
Don't understand GTF/GFF/FASTA in different formats part.

ADD REPLY
2
Entering edit mode

Not bundles, you would choose whether you needed FASTA, GFF or GTF (potentially to expand out to more file types if there is a need), then which tool you intended to use, then it would spit out a GFF (or whatever) file with the chromosome names formatted how you need them, info fields filled in how you need them etc. Even though these are standard formats, it seems that everybody actually makes them differently, so we're trying to make it so that you can get them in the style you need for the tool you're using, hence wanting to know what tools people use.

ADD REPLY
5
Entering edit mode

In order to save a bunch of separate answers let me get some common tools out of the way (in no particular order).

BWA
BOWTIE 1/2
BBMap
GSNAP
HISAT2
STAR
TopHat
Salmon
Kallisto
Bedtools
Bedops

  • It would also be nice if you could provide pre-formatted indexes for the aligners (since people seem to have trouble generating those). Effectively a bundle could be made (if these were also available).
  • "Known_transcriptome.fa" file for genomes (for those who want it, others can ignore it). A corresponding BED file to go with this.
  • "Known_transcriptome.fa" file with just the longest transcript sequence (another common request). A corresponding BED file to go with this.
ADD REPLY
1
Entering edit mode

Can it be setup such that the FASTA and GFF files have the same names for entries? NCBI will sometimes give only the RefSeq ID in the GFF, but in the fasta file it will give the whole title (gbk/refseq/name/etc). Not a huge issue, but it is a little annoying.

ADD REPLY
4
Entering edit mode
6.1 years ago

Nobody mentioned FastQC yet? I bet it's one of the most downloaded...

ADD COMMENT
1
Entering edit mode

But FastQC does not have any external dependencies for reference data :-)

ADD REPLY
3
Entering edit mode

You are right, it was unclear to me whether the OP was restricting to tools with dependencies on reference data format (anyway, one should define what "reference data" is...)

ADD REPLY
3
Entering edit mode
6.1 years ago

fastqc fastx_toolkit bowtie samtools

ADD COMMENT
2
Entering edit mode
6.1 years ago
jotan ★ 1.2k

Seqmonk

(Random text to pass character threshold)

ADD COMMENT
1
Entering edit mode
6.1 years ago
John 13k

I think the goal is clear and good - to abstract away the problem of different file formatting to something that users understand: I want --> FASTA for --> bowtie. I want --> Bedgraph for --> bedtools. etc.

However, I can see this abstraction having three possibly difficult issues to resolve:

1) Tools obviously change, so right now STAR takes only pair-split FASTQ files, not a single interleaved FASTQ file. This might change in the future, meaning that today's "--> STAR" format might not be tomorrow's "--> STAR" format.

2) Where two programs both support the same format (eg, in the future perhaps both STAR and Tophat both support an interleaved FASTQ), but "--> STAR" actually means read-pair-split and "--> Tophat" means interleaved due to legacy reasons, you'll get people downloading 2x as much data from your site. It isn't a 1:1 mapping.

3) "My boss was very specific and told me to get him a half-open half-closed 0-based bedgraph format with integers not floats, binned in 250bp regions -- is that bedgraph of bedops formatting?" 😵🔫

The idea of mapping formats to tools that support them is a fantastic idea -- however, it would be nice if Ensembl gave you the option to choose your data format very specifically like in example 3), but if you don't know what you want, take you to a handy look-up page that can stay updated - perhaps a grid of tools and the formats they currently support. Clicking on a tick mark in such a table could autofill the more detailed form out for you as per example 3).

Its not easy balancing the highly technical desires of some with the ease and simplicity of non-technical software others are used to, but i'm really happy to see that Ensembl is making efforts in this area :)

ADD COMMENT
2
Entering edit mode

We're going to try and stay on top of format-changes from other tools, although sometimes an email reminder from a user is necessary! There will be customisation options, but anything beyond that, if you really need something custom we're going to have to point people to APIs.

ADD REPLY
1
Entering edit mode
6.1 years ago
#### ▴ 220

BED format with 6 & 12 columns GFF/GFF3 format with intron information is also required many times.

ADD COMMENT
1
Entering edit mode
6.0 years ago
Samuel Lampa ★ 1.3k

We tried to create an approximate such list at one HPC center mainly running NGS workloads, based on "module" loads (via the GNU module system), which indicates everytime a script is loading a module for an installed software on the system (it had lots).

The list is from a few years ago (late 2012), but maybe someone else could do a similar list today? The GNU module system is pretty widely used on HPC systems. You just need to get access to some central system-level logs from a sysadmin, for a representative period of time.

ADD COMMENT
0
Entering edit mode
6.1 years ago
Sandeep ▴ 260

To check for alternate splicing: SplAdder

ADD COMMENT

Login before adding your answer.

Traffic: 2088 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6