I am new to Illumina sequencing and I am not an advanced user of all those programs that are required to analyse a large sequencing dataset, however I have ~6mln reads and I need to "do" something with them to complete my PhD. Therefore, I would be very grateful if someone could help me and give me some advices.
I have ~6mln of 76-bp paired-end reads - ~3mln in read1 and ~3mln in read2. First thing I did was to check the quality of the reads. I run FastQC program on read1 and read2 and the quality report showed that the reads are good quality, except that there is high sequence duplication level (60%!). I tired to remove duplicated sequences using Galaxy web-tool FASTX-collapse, however the problem is that Galaxy change the original names of the reads and lose /1 and /2 (indicating paired-ends) that will be needed later for assembly and MEGAN programs.
Can anyone help me please?
Edit, copied from your answer: Ok, thank you all for interest in my topic. Yes, it is true that I poorly understand what I am doing, but I am a molecular biologist and I don't have degree in bioinformatics/statistics/or any computer related field. I don't want to describe here my situation with my supervisor, I have now two ways out from my situation - give up on my PhD or do everything I can do to finish.
Sorry Michael that I didn't give all of these information, I didn't know that this is so important. Here are my answers:
* Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.
The DNA was isolated from bacteriophages isolated from a sputum sample of the hospital patient.
* Is a single organism that the sample is coming from, or a Meta-genome/transcriptome
It is a metagenome, is will contain all phages/viruses present in that sample.
* What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?
Metagenomic, DNA. * Protocols of nucleotide extraction DNA was extracted using proteinaseK/CTAB protocol and amplified using MDA technique (this could be the reason why there are so many duplicates).
* Is there a reference genome to align the reads to?
My idea is that the reads could be aligned to the reference genome chosen on the basis of the Blast results e.g. if most reads give hit to Steptococcus phage Dp-1, it could be used as the reference genome.
* Or is it a de-novo assembly of the genomic sequence that is required?
de-novo, I already learned how to use Velvet assembler.
Also, I apologise for my poor English.