Forum: What are the most popular NGS tools?
5
gravatar for Emily_Ensembl
3.4 years ago by
Emily_Ensembl19k
EMBL-EBI
Emily_Ensembl19k wrote:

We're developing a tool in Ensembl that will allow you to get GTF/GFF/FASTA files of Ensembl data in different formats, such that you can use them for different kinds of NGS analysis. We know that all of these tools have slightly different requirements for the file, even though they claim to be a standard format!

To do this, we're looking to generate a list of the most popular tools that people use ā€“ we'll then look into their requirements and make sure our tool can provide what's needed. Please could you reply with the tools you use, one tool per post, and upvote any posts which already mention the tools you use.

forum format ngs ensembl • 1.7k views
ADD COMMENTlink modified 3.3 years ago by Samuel Lampa1.2k • written 3.4 years ago by Emily_Ensembl19k
2

One can't really post a single tool name as it would be too short as a content.

In general I have observed that scientists like to use meaningful gene names. One of the most common needs is that of "how do I go from an ensemble gene/transcript name to a descriptive name".

A second important need is that of providing high quality and reliable information rather than a comprehensive all-inclusive one. At the same time it is important to understand the rules by which these were created. For example the annotation tracks in IGV are very informative but I don't know what basis were these selected by.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Istvan Albert ♦♦ 81k
1

It would be nice to be able to download (FTP) mapping files for mapping between common identifiers. I realize there is BioMart, but I've not had the best experience with it and ended up using the ensebml MySQL when I'm trying to get the whole gene/transcript/protein set for a species.

ADD REPLYlink written 3.4 years ago by pld4.8k
2

@Emily: You should clarify your question to indicate if you are looking to get a list of all NGS tools or just those which have a strict dependency on reference data (which will be provided by Ensembl).

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax71k
1

I thought that was self-evident.

ADD REPLYlink written 3.4 years ago by Emily_Ensembl19k
2

Perhaps an edit to the title to clarify, "popular tools that require reference data"? Current title sounds like any NGS tool is acceptable.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax71k
2

At the minimum, Ensembl should provide a concatenated reference genome. I know there are "toplevel" FASTAs, but for human, it contains ALT contigs, which most users wouldn't want to use for general mapping. It is interesting that no official databases (I am talking about ucsc/ncbi/ensembl) provide concatenated GRCh37, which is partly why we see so many variants of GRCh37. GRC now provides concatenated GRCh38. I hope Ensembl can do the same, as Ensembl and GRC/UCSC have different naming.

ADD REPLYlink written 3.4 years ago by lh331k
1

Would these be offered as bundles (like what iGenomes provides)?
Don't understand GTF/GFF/FASTA in different formats part.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax71k
2

Not bundles, you would choose whether you needed FASTA, GFF or GTF (potentially to expand out to more file types if there is a need), then which tool you intended to use, then it would spit out a GFF (or whatever) file with the chromosome names formatted how you need them, info fields filled in how you need them etc. Even though these are standard formats, it seems that everybody actually makes them differently, so we're trying to make it so that you can get them in the style you need for the tool you're using, hence wanting to know what tools people use.

ADD REPLYlink written 3.4 years ago by Emily_Ensembl19k
5

In order to save a bunch of separate answers let me get some common tools out of the way (in no particular order).

BWA
BOWTIE 1/2
BBMap
GSNAP
HISAT2
STAR
TopHat
Salmon
Kallisto
Bedtools
Bedops

  • It would also be nice if you could provide pre-formatted indexes for the aligners (since people seem to have trouble generating those). Effectively a bundle could be made (if these were also available).
  • "Known_transcriptome.fa" file for genomes (for those who want it, others can ignore it). A corresponding BED file to go with this.
  • "Known_transcriptome.fa" file with just the longest transcript sequence (another common request). A corresponding BED file to go with this.
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax71k
1

Can it be setup such that the FASTA and GFF files have the same names for entries? NCBI will sometimes give only the RefSeq ID in the GFF, but in the fasta file it will give the whole title (gbk/refseq/name/etc). Not a huge issue, but it is a little annoying.

ADD REPLYlink written 3.4 years ago by pld4.8k
4
gravatar for dariober
3.4 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Nobody mentioned FastQC yet? I bet it's one of the most downloaded...

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by dariober10k
1

But FastQC does not have any external dependencies for reference data :-)

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax71k
3

You are right, it was unclear to me whether the OP was restricting to tools with dependencies on reference data format (anyway, one should define what "reference data" is...)

ADD REPLYlink written 3.4 years ago by dariober10k
3
gravatar for kapil.joshi036
3.4 years ago by
Student ,School of life sciences, Manipal University, Manipal, India
kapil.joshi03680 wrote:

fastqc fastx_toolkit bowtie samtools

ADD COMMENTlink written 3.4 years ago by kapil.joshi03680
2
gravatar for jotan
3.4 years ago by
jotan1.2k
Australia
jotan1.2k wrote:

Seqmonk

(Random text to pass character threshold)

ADD COMMENTlink written 3.4 years ago by jotan1.2k
1
gravatar for John
3.4 years ago by
John12k
Germany
John12k wrote:

I think the goal is clear and good - to abstract away the problem of different file formatting to something that users understand: I want --> FASTA for --> bowtie. I want --> Bedgraph for --> bedtools. etc.

However, I can see this abstraction having three possibly difficult issues to resolve:

1) Tools obviously change, so right now STAR takes only pair-split FASTQ files, not a single interleaved FASTQ file. This might change in the future, meaning that today's "--> STAR" format might not be tomorrow's "--> STAR" format.

2) Where two programs both support the same format (eg, in the future perhaps both STAR and Tophat both support an interleaved FASTQ), but "--> STAR" actually means read-pair-split and "--> Tophat" means interleaved due to legacy reasons, you'll get people downloading 2x as much data from your site. It isn't a 1:1 mapping.

3) "My boss was very specific and told me to get him a half-open half-closed 0-based bedgraph format with integers not floats, binned in 250bp regions -- is that bedgraph of bedops formatting?" šŸ˜µšŸ”«

The idea of mapping formats to tools that support them is a fantastic idea -- however, it would be nice if Ensembl gave you the option to choose your data format very specifically like in example 3), but if you don't know what you want, take you to a handy look-up page that can stay updated - perhaps a grid of tools and the formats they currently support. Clicking on a tick mark in such a table could autofill the more detailed form out for you as per example 3).

Its not easy balancing the highly technical desires of some with the ease and simplicity of non-technical software others are used to, but i'm really happy to see that Ensembl is making efforts in this area :)

ADD COMMENTlink modified 3.4 years ago by Devon Ryan91k • written 3.4 years ago by John12k
2

We're going to try and stay on top of format-changes from other tools, although sometimes an email reminder from a user is necessary! There will be customisation options, but anything beyond that, if you really need something custom we're going to have to point people to APIs.

ADD REPLYlink written 3.4 years ago by Emily_Ensembl19k
1
gravatar for ####
3.4 years ago by
####190
####190 wrote:

BED format with 6 & 12 columns GFF/GFF3 format with intron information is also required many times.

ADD COMMENTlink written 3.4 years ago by ####190
1
gravatar for Samuel Lampa
3.3 years ago by
Samuel Lampa1.2k
Stockholm
Samuel Lampa1.2k wrote:

We tried to create an approximate such list at one HPC center mainly running NGS workloads, based on "module" loads (via the GNU module system), which indicates everytime a script is loading a module for an installed software on the system (it had lots).

The list is from a few years ago (late 2012), but maybe someone else could do a similar list today? The GNU module system is pretty widely used on HPC systems. You just need to get access to some central system-level logs from a sysadmin, for a representative period of time.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Samuel Lampa1.2k
0
gravatar for Sandeep
3.4 years ago by
Sandeep250
Manipal, India
Sandeep250 wrote:

To check for alternate splicing: SplAdder

ADD COMMENTlink written 3.4 years ago by Sandeep250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2521 users visited in the last hour