I Am Preparing A Course On Ngs: Any Suggestion ?
16
81
Entering edit mode
12.3 years ago

Hi all, I am preparing a course on NGS: there will be seven students for 4 hours an I want them to play with some NGS data. No programming skill is required here.

Here is what I plan to do:

• short intro to NGS
• structure of a (small) FASTQ file
• map it with BWA on public Galaxy http://main.g2.bx.psu.edu/
• index the genome and map the fastqs with MAQ
• index the genome and map the fastqs with BWA
• structure of a SAM file
• GATK recalibration (?)
• call the SNPs with samtools pileup and generate a VCF
• explore a BAM file with samtools tview
• find the rs## at UCSC (table browser or mysql )
• predict the consequences of a set of SNPs with polyphen2 (btw is there a way to generate a random fastq file with a set of 'forced' mutations ?)

my other ideas:

• running something in the cloud: do you know if there is a way to run something for free on Amazon ? what kind of analysis could I run ?
• storing something (the VCF ?) in a database (mysql ? sqlite3 ?) and using rails to display the data
• generating the tool for using a webservice (SOAP/REST...): what service could I use for this course ?

Any other suggestion ? What would you like to see during this course ?

Thanks,

Pierre

EDIT: the course should give them the opportunity see what would look the work of someone working with NGS and to have an experience with some real data. I don't know their skill but AFAIK, there are supposed to have some programming courses later.

My only experience is the analysis of "exome capture" data = SNP.

Update: I posted my slides on slideshare: http://www.slideshare.net/lindenb/20101210-ngscourse

next-gen sequencing galaxy • 15k views
5
Entering edit mode

would you accept attendees? ;)

2
Entering edit mode

You could consider applying to AWS for an education credit http://aws.amazon.com/education ... no guarantees, but might help cover at AWS costs for the workshop

0
Entering edit mode

same question as Jorge

0
Entering edit mode

@jorge @Fred, unfortunately I'm far from being an expert with the NGS data :-)

0
Entering edit mode

@Pierre: If you can't accept attendees, you may stream this class online :).

0
Entering edit mode

@Khader , do you speak French ? ;-)

0
Entering edit mode

Course is in French ! So if I attend the course, I can improve my French as well. How about the course materials, slides etc ?

0
Entering edit mode

@Pierre and @Kader : I am starting to play with the youtube API so I would be happy to display the course on Bioinformatics.fr ;-)

0
Entering edit mode

That's cool Fred. So Pierre, you will have enough audience via Bioinformatics. Now the question is will you stream (preferably, an English) version of your course or not ?

0
Entering edit mode

That's cool Fred. So Pierre, you will have enough audience via Bioinformatics.fr. Now the question is will you stream (preferably, an English) version of your course or not ?

30
Entering edit mode
12.3 years ago

Titus Brown at Michigan State University has run a course on Analyzing Next Generation Sequencing Data and as the link shows he has built an amazing resource around it.

His tutorials might give you a good sense on what topics to include and what level of detail may be appropriate.

8
Entering edit mode

and of course drop me a note if you discover bugs, problems, etc. And re-use the material as much as you want -- CC-BY-SA.

0
Entering edit mode

Istvan, you're my hero

0
Entering edit mode

well the the credit should go to Titus - I just happened to know about it

20
Entering edit mode
12.3 years ago
lh3 33k
1. Drop maq as it is not widely used as before.
2. Choose between Galaxy and command line, depending on which suits them best. It should not be hard to find the consensus of 7 students. Similar to cloud computing.
3. I do not work with raw data now. I would guess Illumina base quality should be better than before. In that case, I am not sure if recalibration is absolutely necessary.
4. Introduce IGV instead of samtools' tview. Although tview is useful in a few scenarios, IGV is in general more powerful and user friendly. IGV works with VCF. No need to set up database or services.
5. I know SQL well, but except for setting up a serious web server, I never use it. SQL is overkilling. Those graphic viewers and the UCSC custom track are much more convenient.
6. Mention Picard and GATK, which are both great packages.
7. I do not know how serious duplicates are affecting results, but you should mention this is a potential concern.
8. As others suggested, it would be good to introduce ChIP/RNA-seq and the discovery of structural variations even if these are not the main purposes.

It is important to let students play with real or simulated data.

EDIT: a further comment:

I used to give a two-hour course on variant discovery. I gave each attendee a tar-ball which includes a bacterial genome (S. suis), a variant/read simulator (wgsim), a mapper (bwa), a SNP caller (samtools) and a few scripts. It is only a couple of MB in size, suitable for email. With these, one can do simulation, mapping, SNP calling, visualization and evaluation, nearly the entire pipeline. I have lost the tar-ball, though.

0
Entering edit mode

UCSC custom track is another nice idea. Thanks

0
Entering edit mode

There are also a number of existing tracks--especially in the ENCODE data--that show NGS data types. A survey of a few of those might help people to grasp some of the aspects and challenges. But I think we've already gone way over 4 hours....

9
Entering edit mode
12.3 years ago

I think this could be a bit too much for 4 hours, but depends on the skill of the students. If I would do a hands-on course which it seems to be, it might be ok to only use one alignment tool, and then focus more on the possible downstream analyses.

Some points to consider:

• What is the aim of the course, can you give a more descriptive title than playing with NGS, what should they take home?
• What is the skill level of your students?

• Which applications do you want to present, e.g. re-sequencing, RNA-seq, ChIP-seq, SNP calling, copy-number variation? I guess you could at least mention all these approaches and provide a core toolset that is useful in all or most of applications. Or do you want to focus on SNPs?

• I wouldn't do recalibration, it's maybe too specific.
• Samtools is definitely a good toolkit
• alternatively, I like the Bioconductor short-read tools
• presenting the various file-formats (FASTQ, SAM, BAM) is definitely interesting
• do you want to mention filtering of reads and alignments based on qualities?

It's all your descision and I think it depends mostly on the first 3 points.

Regarding using the EC2 cloud we are using it via a grant (so someone paying for us), I also remember a cloud application for RNA-seq (called Myrna). I don't know if you can EC2 for free, but then, it's just like a remote-login computer, so I don't see the benefit for your course. And your students couldn't use it at home.

Then:

storing something (the VCF ?) in a database (mysql ? sqlite3 ?) and using rails to display the data generating the tool for using a webservice (SOAP/REST...): what service could I use for this course ?

sound like more programming heavy, you said that wasn't a requirement, so I think for 4 hours an introduction to existing tools and toolkits would serve most students who are new to NGS best.

Edit: Oh, yeah, before I forget it, prepare your computer lab in good time, have the software installed on the computers, if you let the students bring their own laptops, you will spend two hours to install stuff on windows or figuring out where this damn terminal application was in windows or why they cannot compile the alignment tool using gcc....

0
Entering edit mode

Michael, I've never played with Bioconductor, what would you suggest to do with it ?

0
Entering edit mode

As far as I remember it is possible to quickly setup a webserver by just creating a SQL table and invoking RAILS. So I thought it would be awesome to visualize a simple SQL table (e.g: chrom,position,SNP, quality)

0
Entering edit mode

It might be difficult to teach BioConductor if one never used it, on the other hand, for you, maybe it's easier. I would show how to read in aligned reads, filter them, compute the coverage and detect peaks, do gene binning, and differential expression analysis. Again a bit too much maybe, but you can choose, most of it can also be done with other tools. A visualization option is nice, and if this rails SQL stuff is really so simple, why not? On the other hand something like IGV might be easier to reproduce by the students.

0
Entering edit mode

About Bioconductor, I was thinking about a simple example, an illustration of a simple script (for example, I found this: http://www.bioconductor.org/help/workflows/high-throughput-sequencing/)

0
Entering edit mode

Something like it, but that example workflow is so boring. Maybe look http://www.bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf, documentation of the short read package. On the other hand this might be better in a separate course.

8
Entering edit mode
12.3 years ago
brentp 24k

I agree with @Michael, that's a lot for 4 hours! I think that unless they are comfortable at the command line, it's a good idea to stick with galaxy. Use BFast or bowtie instead of bwa as those seem to be more current. For 4 hours, I think a reasonable workflow would be:

• get a fastq from SRA (or provided by you)
• look at the quality distribution using fastqc or fastx-toolkit (or whatever galaxy supports)
• do some read trimming/filtering and look at quality again
• map the reads to a reference
• use samtools to sort, index
• look at the alignment graphically: either igv, samtools tview, or galaxy

then if there's time, you could do something with Picard like MarkDuplicates

5
Entering edit mode

Bfast author recommends to use BWA for Illumina mapping. Bowtie is fine for ChIP/RNA-seq, but it is not a good choice for SNP calling as it does not do gapped alignment. BWA has possibly mapped most of Illumina data in the world, because 1) the 1000 Genomes project is using BWA; 2) Sanger is using BWA exclusively for mammalian resequencing projects; 3) Broad, another major sequencing center, is using bwa exclusively; 4) so far as I know, WashU and UWash are also using bwa for resequencing. These major sequencing centers have all carefully evaluated popular mappers. There are reasons.

1
Entering edit mode

I would guess bwa arrived right at the time when people needed a mapper balanced between openness, speed, memory, features, support and, in particular, accuracy. Now bwa is not the only one that achieves the balance, but it is arguably the first.

0
Entering edit mode

Most of large resequencing projects at Sanger are also based on bwa alignment exculsively.

0
Entering edit mode

good to know. i will look into bwa more carefully. do you know the reasons you mention in your last sentence?

7
Entering edit mode
12.2 years ago

In my view, a NextGen Sequencing course should have three main data types:

1. Discovery of variants in human genomes, possibly linking or associating with a disease trait (phenotype). An example of this is the sequencing of "cancer" genomes from a patient's tumor and his/her normal genome (skin, e.g.).

2. RNAseq - as a means to assess gene expression. Read depth correlates with gene expression levels and detection of variants. This is especially cool when comparing to the genome and one sees imbalance in heterozygotes: The A allele has 78% of the reads while the G allele has 22% of the RNAseq reads. Hmmm.

3. Environmental genomics samples. What kind of biological diversity do you see in the human gut, in a hot spring, in a brown field site (contaminated site), in the ocean, etc. etc.

These are the three areas where I see a lot of work being done in research labs and in companies. Make the course as relevant to real-world situations as you can. Your students will thank you fir it!

6
Entering edit mode
12.3 years ago
Mitch Skinner ▴ 660

Amazon does have a free "usage tier", where someone who is new to EC2 can run a "micro" instance free for a year, with 10 gigabytes of storage and a certain amount of IO and external internet transfer. CPU and RAM on micro instances are pretty sharply limited, though; RAM is ~600MB, and CPU is only allowed to be fully used in short bursts (3 minutes if I recall correctly, after which your instance gets throttled).

So, if your data size is small enough, and your students don't already have AWS accounts, then you might be able to get away with it.

If you need more CPU or RAM, though, then it's not too expensive to use EC2 if you only run instances for short periods of time ($0.095 to$2.28 per hour in the EU region, depending on how much CPU and RAM you need). If all of the tools can be used from the command line, or over the web, or with X forwarding, then I think it's a good choice; Titus Brown's course materials have a lot of detail about how it could be done.

If you use an instance with an EBS root volume, then you can set up the disk image ahead of time with all the tools installed. The students could each start an instance based on that. And then, at the end of the course, they could shut down their instance (and not have to pay to keep it running), while still being able to spin it up again later to use or to refer back to. While the instance isn't running, the only cost would be for the EBS storage, which is \$0.10 per gigabyte per month.

So I'd expect that you could use EC2; it might cost something, but I think if it's done right then the cost would be quite low.

0
Entering edit mode

nice point with setting up a unified image containing all tools. It adds a bit more complexity though, you will have to install the cloud tool on the local computers, fire up the instances and so on. It might divert from the main focus a bit, such that in the end it's a bit more like a cloud course.

0
Entering edit mode

The challenge with the micros is that they are not good for computing, but rather for serving web requests. I'd recommend applying for education credits via the link I posted above. Even without them, always a good idea for coursework.

0
Entering edit mode

@Michael: I usually just use the web console to manage my instances; I installed the EC2 command-line tools for launch/shutdown/etc., but I don't use them very often. If I had to manage instances programmatically, then it would be different, but for a course like this I think the web tool is fine. The main thing you need on the local machines is SSH.

@mndoci: I agree that micro instances are probably not appropriate; I didn't know about the education credits, thanks for the link.

6
Entering edit mode
10.8 years ago

You might also explore the past lectures from the Canadian Bioinformatics Workshop series (scroll down to Past Courses at bottom). They have complete slide sets (pdf|ppt: CC-BY-SA) and even videos of the lectures. Topics in 2011 included "Informatics on High Throughput Sequencing Data" as well as "Exploratory Data Analysis and Essential Statistics using R", "Pathway and Network Analysis on – omics Data", "Informatics and Statistics for Metabolomics", "Microarray Data Analysis", and "Bioinformatics for Cancer Genomics".

4
Entering edit mode
12.3 years ago

I have recently seen this Post by Jeff Servers for Nothing, Bits for Free. I think you could get free access to AWS cloud using a non-paid linux AMI.

Disclaimer: I haven't tried this yet.

4
Entering edit mode
12.3 years ago
Thaman ★ 3.3k

If it's just about 4 hours class then I think open helix tutorial on Galaxy is enough which included all about uploading, preparing, filtering and analyzing data on different server

http://www.openhelix.com//cgi/tutorialInfo.cgi?id=82

4
Entering edit mode
12.2 years ago

The EC2 thing has definite pros and cons. We had a fellow at work take the MSU course and he said everything ran very smoothly at the course, but when he got home he couldn't replicate the setup because he didn't know how to do things like alter his PATH, or interpret errors related to missing dependencies.

If you go the opposite route and force students to install everything and encounter real problems, they will resent you during the course but will benefit in the long run.

Last month I taught a session in NGS for CSHL's Programming for Biology. These slides might be helpful to you: http://gorgonzola.cshl.edu/pfb/2010/LectureNotes/ngs2/ngs2.pdf

0
Entering edit mode

Just a quick comment in defense of the MSU course here -- that's what we did have them do :). But there is a limit to what you can effectively teach. This year we gave them pre-packaged AMIs. We'll see how that shakes out over time.

4
Entering edit mode
10.7 years ago
Bioinfosm ▴ 620

Too late for Pierre's question, but I found this presentation by John McPherson really extensive and informative. The first half is almost all intro to NGS for first timers:

http://bioinformatics.ca/workshops/2011/course-content

Module 1: Introduction to cancer genomics (Faculty: John McPherson)

PDF | PPT | MP4 (VIDEO)

2
Entering edit mode

never too late ;-) - thanks for pointing out this great resource

3
Entering edit mode
12.3 years ago

I would agree with the suggestion to use only one alignment tool (maybe discuss another shortly and give usage examples in some take-home notes, but don't spend too much time on it). Perhaps that time could be invested in discussing and using NGS visualization tools (IGV or Tablet, for example). The "no programming skills" scientists I know all have a healthy amount of skepticism for bioinformatics software and like to visually analyze any results coming out the tail end of an analysis. Sometimes these visualization tools are useful in identifying issues that are hard to read using less or vim!

3
Entering edit mode
12.3 years ago
jvijai ★ 1.2k

I attended the CSH Adv Sequencing Course and the structure looks similar. The hardest part I thought was for split read aligners and deciphering StrVars. Something to stay away from for now. Do you plan on explaining SM-hashing and BWA? 4hrs is tough; no? I have to agree with previous posters that in 4 hrs, your best bet is to get them to install a local copy of Galaxy and run through 2-3 workflows, save workflows, modify and run again on larger dataset as a homework assignment. ~JVJ

1
Entering edit mode

0
Entering edit mode

ok what is SM-hashing?

0
Entering edit mode

Typo, I meant as Smith Waterman, hashing and Burrows Wheeler. Sorry.

2
Entering edit mode
11.4 years ago
Niek De Klein ★ 2.6k

Maybe some normalisation of the NGS data, can be done easily with excel without programming skills, giving them an idea of different biases, where the data comes from and how the final output is generated.

2
Entering edit mode
9.5 years ago
rob234king ▴ 610

Again, maybe the threads a bit old now but hopefully this may be of some use. I’ve put together a tutorial website to share pdf bioinformatics tutorials. There are no tutorials online at present.

This website was created to share bioinformatics tutorials and create a dynamic learning environment that does not become dated by allowing the user community to upload and review tutorials, PDF contributions welcome. I would be interested to get some feedback.

0
Entering edit mode

In this other questions A: New free community tutorial website for bioinformatics learning you mention that you removed all the tutorials form your site. Should you delete this answer redirecting to it?

0
Entering edit mode

Yes, thanks. Corrected now.

1
Entering edit mode
9.5 years ago
sarahhunter ▴ 600

EBI offers a number of (in person and online) courses for NGS analysis (one is currently underway as I type!).

see:

Traffic: 1070 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.