Question

I Am Preparing A Course On Ngs: Any Suggestion ?

81

Entering edit mode

14.7 years ago

Pierre Lindenbaum 166k

Hi all, I am preparing a course on NGS: there will be seven students for 4 hours an I want them to play with some NGS data. No programming skill is required here.

Here is what I plan to do:

short intro to NGS
structure of a (small) FASTQ file
map it with BWA on public Galaxy http://main.g2.bx.psu.edu/
index the genome and map the fastqs with MAQ
index the genome and map the fastqs with BWA
structure of a SAM file
GATK recalibration (?)
call the SNPs with samtools pileup and generate a VCF
explore a BAM file with samtools tview
find the rs## at UCSC (table browser or mysql )
predict the consequences of a set of SNPs with polyphen2 (btw is there a way to generate a random fastq file with a set of 'forced' mutations ?)

my other ideas:

running something in the cloud: do you know if there is a way to run something for free on Amazon ? what kind of analysis could I run ?
storing something (the VCF ?) in a database (mysql ? sqlite3 ?) and using rails to display the data
generating the tool for using a webservice (SOAP/REST...): what service could I use for this course ?

Any other suggestion ? What would you like to see during this course ?

I'll validate the answer with highest number of votes next week.

Thanks,

Pierre

EDIT: the course should give them the opportunity see what would look the work of someone working with NGS and to have an experience with some real data. I don't know their skill but AFAIK, there are supposed to have some programming courses later.

My only experience is the analysis of "exome capture" data = SNP.

Update: I posted my slides on slideshare: http://www.slideshare.net/lindenb/20101210-ngscourse

next-gen sequencing galaxy • 20k views

ADD COMMENT • link updated 11.9 years ago by sarahhunter ▴ 600 • written 14.7 years ago by Pierre Lindenbaum 166k

5

Entering edit mode

would you accept attendees? ;)

ADD REPLY • link 14.7 years ago by Jorge Amigo 14k

2

Entering edit mode

You could consider applying to AWS for an education credit http://aws.amazon.com/education ... no guarantees, but might help cover at AWS costs for the workshop

ADD REPLY • link 14.7 years ago by Mndoci ★ 1.2k

0

Entering edit mode

same question as Jorge

ADD REPLY • link 14.7 years ago by Fred Fleche 4.3k

0

Entering edit mode

@jorge @Fred, unfortunately I'm far from being an expert with the NGS data :-)

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre: If you can't accept attendees, you may stream this class online :).

ADD REPLY • link 14.7 years ago by Khader Shameer 18k

0

Entering edit mode

@Khader , do you speak French ? ;-)

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Course is in French ! So if I attend the course, I can improve my French as well. How about the course materials, slides etc ?

ADD REPLY • link 14.7 years ago by Khader Shameer 18k

0

Entering edit mode

@Pierre and @Kader : I am starting to play with the youtube API so I would be happy to display the course on Bioinformatics.fr ;-)

ADD REPLY • link 14.7 years ago by Fred Fleche 4.3k

0

Entering edit mode

That's cool Fred. So Pierre, you will have enough audience via Bioinformatics. Now the question is will you stream (preferably, an English) version of your course or not ?

ADD REPLY • link 14.7 years ago by Khader Shameer 18k

0

Entering edit mode

That's cool Fred. So Pierre, you will have enough audience via Bioinformatics.fr. Now the question is will you stream (preferably, an English) version of your course or not ?

ADD REPLY • link 14.7 years ago by Khader Shameer 18k

score 30 · Answer 1 · 2010-11-05

30

Entering edit mode

14.7 years ago

Istvan Albert 102k

Titus Brown at Michigan State University has run a course on Analyzing Next Generation Sequencing Data and as the link shows he has built an amazing resource around it.

His tutorials might give you a good sense on what topics to include and what level of detail may be appropriate.

ADD COMMENT • link 14.7 years ago by Istvan Albert 102k

8

Entering edit mode

and of course drop me a note if you discover bugs, problems, etc. And re-use the material as much as you want -- CC-BY-SA.

ADD REPLY • link 14.0 years ago by Titus Brown ▴ 80

0

Entering edit mode

Istvan, you're my hero

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

well the the credit should go to Titus - I just happened to know about it

ADD REPLY • link 14.7 years ago by Istvan Albert 102k

score 20 · Answer 2 · 2010-11-05

Drop maq as it is not widely used as before.
Choose between Galaxy and command line, depending on which suits them best. It should not be hard to find the consensus of 7 students. Similar to cloud computing.
I do not work with raw data now. I would guess Illumina base quality should be better than before. In that case, I am not sure if recalibration is absolutely necessary.
Introduce IGV instead of samtools' tview. Although tview is useful in a few scenarios, IGV is in general more powerful and user friendly. IGV works with VCF. No need to set up database or services.
I know SQL well, but except for setting up a serious web server, I never use it. SQL is overkilling. Those graphic viewers and the UCSC custom track are much more convenient.
Mention Picard and GATK, which are both great packages.
I do not know how serious duplicates are affecting results, but you should mention this is a potential concern.
As others suggested, it would be good to introduce ChIP/RNA-seq and the discovery of structural variations even if these are not the main purposes.

It is important to let students play with real or simulated data.

EDIT: a further comment:

I used to give a two-hour course on variant discovery. I gave each attendee a tar-ball which includes a bacterial genome (S. suis), a variant/read simulator (wgsim), a mapper (bwa), a SNP caller (samtools) and a few scripts. It is only a couple of MB in size, suitable for email. With these, one can do simulation, mapping, SNP calling, visualization and evaluation, nearly the entire pipeline. I have lost the tar-ball, though.

Ram · Answer 3 · 2010-11-05

9

Entering edit mode

14.7 years ago

Michael 55k

I think this could be a bit too much for 4 hours, but depends on the skill of the students. If I would do a hands-on course which it seems to be, it might be ok to only use one alignment tool, and then focus more on the possible downstream analyses.

Some points to consider:

What is the aim of the course, can you give a more descriptive title than playing with NGS, what should they take home?
What is the skill level of your students?
Which applications do you want to present, e.g. re-sequencing, RNA-seq, ChIP-seq, SNP calling, copy-number variation? I guess you could at least mention all these approaches and provide a core toolset that is useful in all or most of applications. Or do you want to focus on SNPs?
I wouldn't do recalibration, it's maybe too specific.
Samtools is definitely a good toolkit
alternatively, I like the Bioconductor short-read tools
presenting the various file-formats (FASTQ, SAM, BAM) is definitely interesting
do you want to mention filtering of reads and alignments based on qualities?

It's all your descision and I think it depends mostly on the first 3 points.

Regarding using the EC2 cloud we are using it via a grant (so someone paying for us), I also remember a cloud application for RNA-seq (called Myrna). I don't know if you can EC2 for free, but then, it's just like a remote-login computer, so I don't see the benefit for your course. And your students couldn't use it at home.

Then:

storing something (the VCF ?) in a database (mysql ? sqlite3 ?) and using rails to display the data generating the tool for using a webservice (SOAP/REST...): what service could I use for this course ?

sound like more programming heavy, you said that wasn't a requirement, so I think for 4 hours an introduction to existing tools and toolkits would serve most students who are new to NGS best.

Edit: Oh, yeah, before I forget it, prepare your computer lab in good time, have the software installed on the computers, if you let the students bring their own laptops, you will spend two hours to install stuff on windows or figuring out where this damn terminal application was in windows or why they cannot compile the alignment tool using gcc....

ADD COMMENT • link 14.7 years ago by Michael 55k

0

Entering edit mode

Michael, I've never played with Bioconductor, what would you suggest to do with it ?

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

As far as I remember it is possible to quickly setup a webserver by just creating a SQL table and invoking RAILS. So I thought it would be awesome to visualize a simple SQL table (e.g: chrom,position,SNP, quality)

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

It might be difficult to teach BioConductor if one never used it, on the other hand, for you, maybe it's easier. I would show how to read in aligned reads, filter them, compute the coverage and detect peaks, do gene binning, and differential expression analysis. Again a bit too much maybe, but you can choose, most of it can also be done with other tools. A visualization option is nice, and if this rails SQL stuff is really so simple, why not? On the other hand something like IGV might be easier to reproduce by the students.

ADD REPLY • link 14.7 years ago by Michael 55k

0

Entering edit mode

About Bioconductor, I was thinking about a simple example, an illustration of a simple script (for example, I found this: http://www.bioconductor.org/help/workflows/high-throughput-sequencing/)

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Something like it, but that example workflow is so boring. Maybe look http://www.bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf, documentation of the short read package. On the other hand this might be better in a separate course.

ADD REPLY • link 14.7 years ago by Michael 55k

score 8 · Answer 4 · 2010-11-05

8

Entering edit mode

14.7 years ago

brentp 24k

I agree with @Michael, that's a lot for 4 hours! I think that unless they are comfortable at the command line, it's a good idea to stick with galaxy. Use BFast or bowtie instead of bwa as those seem to be more current. For 4 hours, I think a reasonable workflow would be:

get a fastq from SRA (or provided by you)
look at the quality distribution using fastqc or fastx-toolkit (or whatever galaxy supports)
do some read trimming/filtering and look at quality again
map the reads to a reference
use samtools to sort, index
look at the alignment graphically: either igv, samtools tview, or galaxy

then if there's time, you could do something with Picard like MarkDuplicates

ADD COMMENT • link 14.7 years ago by brentp 24k

5

Entering edit mode

Bfast author recommends to use BWA for Illumina mapping. Bowtie is fine for ChIP/RNA-seq, but it is not a good choice for SNP calling as it does not do gapped alignment. BWA has possibly mapped most of Illumina data in the world, because 1) the 1000 Genomes project is using BWA; 2) Sanger is using BWA exclusively for mammalian resequencing projects; 3) Broad, another major sequencing center, is using bwa exclusively; 4) so far as I know, WashU and UWash are also using bwa for resequencing. These major sequencing centers have all carefully evaluated popular mappers. There are reasons.

ADD REPLY • link 14.7 years ago by lh3 33k

1

Entering edit mode

I would guess bwa arrived right at the time when people needed a mapper balanced between openness, speed, memory, features, support and, in particular, accuracy. Now bwa is not the only one that achieves the balance, but it is arguably the first.

ADD REPLY • link 14.7 years ago by lh3 33k

0

Entering edit mode

Most of large resequencing projects at Sanger are also based on bwa alignment exculsively.

ADD REPLY • link 14.7 years ago by lh3 33k

0

Entering edit mode

good to know. i will look into bwa more carefully. do you know the reasons you mention in your last sentence?

ADD REPLY • link 14.7 years ago by brentp 24k

score 7 · Answer 5 · 2010-11-08

In my view, a NextGen Sequencing course should have three main data types:

Discovery of variants in human genomes, possibly linking or associating with a disease trait (phenotype). An example of this is the sequencing of "cancer" genomes from a patient's tumor and his/her normal genome (skin, e.g.).
RNAseq - as a means to assess gene expression. Read depth correlates with gene expression levels and detection of variants. This is especially cool when comparing to the genome and one sees imbalance in heterozygotes: The A allele has 78% of the reads while the G allele has 22% of the RNAseq reads. Hmmm.
Environmental genomics samples. What kind of biological diversity do you see in the human gut, in a hot spring, in a brown field site (contaminated site), in the ocean, etc. etc.

These are the three areas where I see a lot of work being done in research labs and in companies. Make the course as relevant to real-world situations as you can. Your students will thank you fir it!

score 6 · Answer 6 · 2010-11-06

Amazon does have a free "usage tier", where someone who is new to EC2 can run a "micro" instance free for a year, with 10 gigabytes of storage and a certain amount of IO and external internet transfer. CPU and RAM on micro instances are pretty sharply limited, though; RAM is ~600MB, and CPU is only allowed to be fully used in short bursts (3 minutes if I recall correctly, after which your instance gets throttled).

So, if your data size is small enough, and your students don't already have AWS accounts, then you might be able to get away with it.

If you need more CPU or RAM, though, then it's not too expensive to use EC2 if you only run instances for short periods of time ($0.095 to $2.28 per hour in the EU region, depending on how much CPU and RAM you need). If all of the tools can be used from the command line, or over the web, or with X forwarding, then I think it's a good choice; Titus Brown's course materials have a lot of detail about how it could be done.

If you use an instance with an EBS root volume, then you can set up the disk image ahead of time with all the tools installed. The students could each start an instance based on that. And then, at the end of the course, they could shut down their instance (and not have to pay to keep it running), while still being able to spin it up again later to use or to refer back to. While the instance isn't running, the only cost would be for the EBS storage, which is $0.10 per gigabyte per month.

So I'd expect that you could use EC2; it might cost something, but I think if it's done right then the cost would be quite low.

score 6 · Answer 7 · 2012-04-05

You might also explore the past lectures from the Canadian Bioinformatics Workshop series (scroll down to Past Courses at bottom). They have complete slide sets (pdf|ppt: CC-BY-SA) and even videos of the lectures. Topics in 2011 included "Informatics on High Throughput Sequencing Data" as well as "Exploratory Data Analysis and Essential Statistics using R", "Pathway and Network Analysis on – omics Data", "Informatics and Statistics for Metabolomics", "Microarray Data Analysis", and "Bioinformatics for Cancer Genomics".

score 4 · Answer 8 · 2010-11-05

4

Entering edit mode

14.7 years ago

Khader Shameer 18k

I have recently seen this Post by Jeff Servers for Nothing, Bits for Free. I think you could get free access to AWS cloud using a non-paid linux AMI.

Disclaimer: I haven't tried this yet.

ADD COMMENT • link 14.7 years ago by Khader Shameer 18k

score 4 · Answer 9 · 2010-11-06

4

Entering edit mode

14.7 years ago

Thaman ★ 3.3k

If it's just about 4 hours class then I think open helix tutorial on Galaxy is enough which included all about uploading, preparing, filtering and analyzing data on different server

http://www.openhelix.com//cgi/tutorialInfo.cgi?id=82

ADD COMMENT • link 14.7 years ago by Thaman ★ 3.3k

score 4 · Answer 10 · 2010-11-09

4

Entering edit mode

14.7 years ago

Jeremy Leipzig 23k

The EC2 thing has definite pros and cons. We had a fellow at work take the MSU course and he said everything ran very smoothly at the course, but when he got home he couldn't replicate the setup because he didn't know how to do things like alter his PATH, or interpret errors related to missing dependencies.

If you go the opposite route and force students to install everything and encounter real problems, they will resent you during the course but will benefit in the long run.

Last month I taught a session in NGS for CSHL's Programming for Biology. These slides might be helpful to you: http://gorgonzola.cshl.edu/pfb/2010/LectureNotes/ngs2/ngs2.pdf

ADD COMMENT • link 14.7 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Just a quick comment in defense of the MSU course here -- that's what we did have them do :). But there is a limit to what you can effectively teach. This year we gave them pre-packaged AMIs. We'll see how that shakes out over time.

ADD REPLY • link 14.0 years ago by Titus Brown ▴ 80

score 4 · Answer 11 · 2012-05-11

4

Entering edit mode

13.2 years ago

Bioinfosm ▴ 620

Too late for Pierre's question, but I found this presentation by John McPherson really extensive and informative. The first half is almost all intro to NGS for first timers:

http://bioinformatics.ca/workshops/2011/course-content

Module 1: Introduction to cancer genomics (Faculty: John McPherson)

PDF | PPT | MP4 (VIDEO)

ADD COMMENT • link 13.2 years ago by Bioinfosm ▴ 620

2

Entering edit mode

never too late ;-) - thanks for pointing out this great resource

ADD REPLY • link 13.2 years ago by Istvan Albert 102k

score 3 · Answer 12 · 2010-11-05

I would agree with the suggestion to use only one alignment tool (maybe discuss another shortly and give usage examples in some take-home notes, but don't spend too much time on it). Perhaps that time could be invested in discussing and using NGS visualization tools (IGV or Tablet, for example). The "no programming skills" scientists I know all have a healthy amount of skepticism for bioinformatics software and like to visually analyze any results coming out the tail end of an analysis. Sometimes these visualization tools are useful in identifying issues that are hard to read using less or vim!

score 3 · Answer 13 · 2010-11-07

3

Entering edit mode

14.7 years ago

jvijai ★ 1.2k

I attended the CSH Adv Sequencing Course and the structure looks similar. The hardest part I thought was for split read aligners and deciphering StrVars. Something to stay away from for now. Do you plan on explaining SM-hashing and BWA? 4hrs is tough; no? I have to agree with previous posters that in 4 hrs, your best bet is to get them to install a local copy of Galaxy and run through 2-3 workflows, save workflows, modify and run again on larger dataset as a homework assignment. ~JVJ

ADD COMMENT • link 14.7 years ago by jvijai ★ 1.2k

1

Entering edit mode

You can learn more about the Advanced Sequencing Technologies and Applications course here: http://meetings.cshl.edu/courses/c-seqtech12.shtml

ADD REPLY • link 13.2 years ago by Malachi Griffith 20k

0

Entering edit mode

ok what is SM-hashing?

ADD REPLY • link 14.7 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Typo, I meant as Smith Waterman, hashing and Burrows Wheeler. Sorry.

ADD REPLY • link 14.6 years ago by jvijai ★ 1.2k

score 2 · Answer 14 · 2011-09-25

2

Entering edit mode

13.8 years ago

Niek De Klein ★ 2.6k

Maybe some normalisation of the NGS data, can be done easily with excel without programming skills, giving them an idea of different biases, where the data comes from and how the final output is generated.

ADD COMMENT • link 13.8 years ago by Niek De Klein ★ 2.6k

score 2 · Answer 15 · 2013-07-23

2

Entering edit mode

12.0 years ago

rob234king ▴ 610

Again, maybe the threads a bit old now but hopefully this may be of some use. I’ve put together a tutorial website to share pdf bioinformatics tutorials. There are no tutorials online at present.

This website was created to share bioinformatics tutorials and create a dynamic learning environment that does not become dated by allowing the user community to upload and review tutorials, PDF contributions welcome. I would be interested to get some feedback.

http://elvis.ccc.cranfield.ac.uk/CUBELP/faces/login-page.xhtml

ADD COMMENT • link 11.9 years ago by rob234king ▴ 610

0

Entering edit mode

In this other questions A: New free community tutorial website for bioinformatics learning you mention that you removed all the tutorials form your site. Should you delete this answer redirecting to it?

ADD REPLY • link 12.0 years ago by Eric Normandeau 11k

0

Entering edit mode

Yes, thanks. Corrected now.

ADD REPLY • link 11.9 years ago by rob234king ▴ 610

score 1 · Answer 16 · 2013-07-26

EBI offers a number of (in person and online) courses for NGS analysis (one is currently underway as I type!).

see:

26-27 July 2013 http://www.ebi.ac.uk/training/course/introduction-next-generation-sequencing-cambridge-uk
14-17 October 2013 http://www.ebi.ac.uk/training/course/next-generation-sequencing-workshop-0 (registration deadline - 16th August)
Online http://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course