Question: Starting With Illumina Paired-End Reads Manipulation
3
gravatar for Kamila
8.8 years ago by
Kamila70
Kamila70 wrote:

Hello,

I am new to Illumina sequencing and I am not an advanced user of all those programs that are required to analyse a large sequencing dataset, however I have ~6mln reads and I need to "do" something with them to complete my PhD. Therefore, I would be very grateful if someone could help me and give me some advices.

I have ~6mln of 76-bp paired-end reads - ~3mln in read1 and ~3mln in read2. First thing I did was to check the quality of the reads. I run FastQC program on read1 and read2 and the quality report showed that the reads are good quality, except that there is high sequence duplication level (60%!). I tired to remove duplicated sequences using Galaxy web-tool FASTX-collapse, however the problem is that Galaxy change the original names of the reads and lose /1 and /2 (indicating paired-ends) that will be needed later for assembly and MEGAN programs.

Can anyone help me please?

Kamila

Edit, copied from your answer: Ok, thank you all for interest in my topic. Yes, it is true that I poorly understand what I am doing, but I am a molecular biologist and I don't have degree in bioinformatics/statistics/or any computer related field. I don't want to describe here my situation with my supervisor, I have now two ways out from my situation - give up on my PhD or do everything I can do to finish.

Sorry Michael that I didn't give all of these information, I didn't know that this is so important. Here are my answers:

* Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.

The DNA was isolated from bacteriophages isolated from a sputum sample of the hospital patient.

* Is a single organism that the sample is coming from, or a Meta-genome/transcriptome

It is a metagenome, is will contain all phages/viruses present in that sample.

* What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?

Metagenomic, DNA. * Protocols of nucleotide extraction DNA was extracted using proteinaseK/CTAB protocol and amplified using MDA technique (this could be the reason why there are so many duplicates).

* Is there a reference genome to align the reads to?

My idea is that the reads could be aligned to the reference genome chosen on the basis of the Blast results e.g. if most reads give hit to Steptococcus phage Dp-1, it could be used as the reference genome.

* Or is it a de-novo assembly of the genomic sequence that is required?

de-novo, I already learned how to use Velvet assembler.

Also, I apologise for my poor English.

ADD COMMENTlink modified 8.8 years ago by Mike10 • written 8.8 years ago by Kamila70
8

I pitty you, really, because this is giving us a desastrous impression of your supervision situation. "do something" with this random data I throw at you, doesn't sound like good understanding of the field. On the other hand, aligning some reads with the help of this forum would not constitute a PhD. I suggest you re-formulate your question by answering all items in my answer below.

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

What is your application? FastQC will report high sequence duplication for certain applications, this is not necessarily a problem..

ADD REPLYlink written 8.8 years ago by Daniel Swan13k

It is important to remove duplicates to obtain good quality contigs, also it will reduce files size and time for blast runs.

ADD REPLYlink written 8.8 years ago by Kamila70

This is really a question for your Supervisor. In any event, without knowing the source of the reads, there's not much any one can do to help you.

ADD REPLYlink written 8.8 years ago by User 308410

I wish my supervisor could help me with this..

ADD REPLYlink written 8.8 years ago by Kamila70

This should be a comment rather than an answer.

ADD REPLYlink written 8.8 years ago by Michael Schubert6.9k

It will take too long to blast even 3 million of your reads. You need to use a short read aligner, such as BWA, Maq etc.

ADD REPLYlink written 8.8 years ago by User 308410

uh, virus metagenomics, not really my field, I hope some experts around. I will retag for now

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k
3
gravatar for Michael Dondrup
8.8 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

We need to know the following info to help effectively:

  • Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.
  • Is a single organism that the sample is coming from, or a Meta-genome/transcriptome
  • What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?
  • Protocols of nucleotide extraction
  • Is there a reference genome to align the reads to?
  • Or is it a de-novo assembly of the genomic sequence that is required?

You see, there are so many possibilities....

ADD COMMENTlink written 8.8 years ago by Michael Dondrup47k
2
gravatar for Markp
8.8 years ago by
Markp40
Markp40 wrote:

This seems to be an appropriate response to your supervisor http://www.youtube.com/watch?v=Fl4L4M8m4d0

ADD COMMENTlink written 8.8 years ago by Markp40

+1 cause I really had fun. Lots of my colleagues will feel being understood elsewhere. Even if it does not correspond the policy of this site.

ADD REPLYlink written 8.8 years ago by toni2.1k

No, it doesn't, but I hope it helps to keep the mood up.

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k
2
gravatar for Kamila
8.8 years ago by
Kamila70
Kamila70 wrote:

I just want to say that I solved the problem with duplicates with the program called cd-hit-454. The program is very fast and doesn't change reads names.

ADD COMMENTlink written 8.8 years ago by Kamila70

That appears to be specific to 454 generated data, and you stated that you are using Illumina.

ADD REPLYlink written 6.6 years ago by xapple230
1
gravatar for Kamila
8.8 years ago by
Kamila70
Kamila70 wrote:

Why should I have a control? There is many publications about metagenomics done without any controls e.g. in "Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery" they analyse viruses in faeces of hospital patient's with diarrhoea and they don't analyse the healthy individuals faeces as a control.

Funny, but I think I just found answer to my question in this article! In this article they use this program to remove duplicate sequences: http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD COMMENTlink written 8.8 years ago by Kamila70

Well, if removing duplicates is the only thing required maybe, but I had the impression you were asking for more general advice on how to analyse your data?

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

Yes Michael, I will be very grateful for any ideas.

ADD REPLYlink written 8.8 years ago by Kamila70
1
gravatar for Marina Manrique
8.8 years ago by
Marina Manrique1.3k
Granada
Marina Manrique1.3k wrote:

Since it's a metagenomic project I would try first to characterize the viral populations you have in the sample.

One "simple" thing that I would try would be to blast all those reads against NCBI nt database to annotate them. This kind of massive analysis can be achieved with cloud computing.

Once you have all your reads annotated with known sequences you could maybe study the distribution of populations you have.

If you don't feel capable to do such kind of analysis you could try to collaborate with someone. I'm pretty sure that some people would like to analyze your data.

Other kind of analysis could be trying to assemble all the reads and identify viral genomes. I don't know how difficult this can be (I don't know much about viral metagenomics) but a de novo assembly of metagenomes with Illumina sounds to me more complicated.

Searching quick in pubmed I've found this paper http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2919852/?tool=pubmed maybe it helps you

ADD COMMENTlink written 8.8 years ago by Marina Manrique1.3k
1
gravatar for Mike
8.8 years ago by
Mike10
Mike10 wrote:

Forrest Rohwer has done lots of work analysing phage metagenomic samples, including some in a healthcare setting with Cystic Fibrosis. There are likely to be plenty of analysis approaches that would be appropriate to your work in his publications. http://coralandphage.org/

Good luck ;)

ADD COMMENTlink written 8.8 years ago by Mike10
0
gravatar for Sarah
8.8 years ago by
Sarah20
Sarah20 wrote:

Uhhhh...you don't seem to have an experiment there in that you seem to have a test condition (phlem from sick guy) but you don't seem to have a control (phlem from a healthy guy).

I agree that the best bet might be a survey approach, but it might not have much chance of a high level publication without a control (on its own) so don't spend more time on it than it is worth.

Having a bad supervisor prepares one for the real world better than being coddled.

ADD COMMENTlink written 8.8 years ago by Sarah20

Why should I have a control? There is many publications about metagenomics without any controls e.g. in "Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery" they analyse viruses in faeces of hospital patient's with diarrhoea and they don't do healthy individuals faeces as a control.

Funny, but I think I just found answer to my question in this article! In this article they use this program to remove duplicate sequences: http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD REPLYlink written 8.8 years ago by Kamila70
0
gravatar for Kamila
8.8 years ago by
Kamila70
Kamila70 wrote:

Marina Manrique thank you for your answer. With help of my friend (who is not a scientist) I learned how to run Blast on University server and I have got a preliminary results. I also learned how to use Velvet assembler. Basically, 'everything what seems to be impossible today, it becomes possible tomorrow'. Now, I want to do everything exactly how it should be done, in order to make it publishable and complete my PhD. My supervisor doesn't want to collaborate with anyone, he says that "this is easy". Instead of getting to the end of my PhD I find myself fighting with my supervisor and seeking for help on forum. All I dream of is to finish and find a job within a group doing 'real' metagenomics. But, without any publications I don't have chance to obtain that.

ADD COMMENTlink written 8.8 years ago by Kamila70

Good luck then. When you don't know how to go on (and your supervisor doesn't seem to help) look for papers related to your problem and check their material and methods, sometimes they're useful :)

ADD REPLYlink written 8.8 years ago by Marina Manrique1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2167 users visited in the last hour