Question

How to prepare NGS data before running further analysis

0

Entering edit mode

6.8 years ago

nusrat.bot • 0

Hello Everybody,

I was trying to process my NGS data before further analysis, however, there was something problem in my data I could not understand. For your information, I downloaded public paired-end Illumina HiSeq 2000 NGS data in FastQ format. In the Fastq directory, I have two folder DRR003655_1.fastq.bz2 and DRR003655_2.fastq.bz2.

I assume that DRR003655_1.fastq.bz2 is for forward read sequence and DRR003655_2.fastq.bz2 for reverse read sequence folder of illumina paired end sequence data. After quality control with FastQc, I found pretty good data in both of the folder that means my reads have no contamination with adapter sequence and all sequence reads are same base pair length. Therefore I didn't use any further manipulation like adapter removal or trimmomatic tools.

I directly convert my both of the sequence into fasta file and uploaded in Galaxy for my further analysis.

However, my problem is when I was trying to interlace two fasta file, using the the galaxy tools fasta interlacer selecting left hand mate and right-hand mate (DRR003655_1.fasta and DRR003655_2.fasta) and execute the command The galaxy shows some error in my data.

It warns that the program could not find the pair read mate and there is a problem in my data. I check the name of the read it is found for DRR003655_1.fastq.bz2
the first read name DRR003655.1 FCD0RCJACXX:5:1101:1180:2119 and for DRR003655_2.fastq.bz2 the first read name DRR003655.1 FCD0RCJACXX:5:1101:1180:2119

However, I tried to use the tools remane sequence for both of the fasta file and interlace them again but it shows the same problem. Interestingly when I use fasta Joiner tools to join both of the file the program gave me a significant number of joined reads with some single reads. I really don't understand what is going on in my reads sequence. To mention here i am quite new in bioinformatics and just trying to learn some basic bioinformatics tools using galaxy.

Can anyone tell me what is the problem here and how can I solve that problem? My ultimate goal of this analysis is to simply interlace the both paired end read data, after that using sequence sampling I want to narrow down my sequence read and do some further analysis.

All comment and help are highly appreciating:))

next-gen • 2.2k views

ADD COMMENT • link updated 6.8 years ago by WouterDeCoster 47k • written 6.8 years ago by nusrat.bot • 0

2

Entering edit mode

Can anyone tell me what is the problem here and how can I solve that problem?

First problem is you converted your fastq files into fasta format. This is rarely required and you actually lost information about quality scores for the bases in the process. Analyze the data as fastq, where possible.

My ultimate goal of this analysis is to simply interlace the both paired end read data, after that using sequence sampling I want to narrow down my sequence read and do some further analysis.

Following assumes that you are able to use the command line (on any OS, with Java available). While it is not completely clear what you are trying to do eventually you can use reformat.sh program from BBMap suite to interleave your paired-end reads like: reformat.sh in1=DRR003655_1.fastq.bz2 in2=DRR003655_2.fastq.bz2 out=DRR003655_int.fastq.bz2 verifypaired=t (I think verifypaired flag should work to test if your reads are in the proper order in your two files). You may be able to sample reads in the same step (take a look at the help for sampling parameters for reformat.sh).

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

First problem is you converted your fastq files into fasta format. This is rarely required and you actually lost information about quality scores for the bases in the process. Analyze the data as fastq, where possible**

Yes I also thinked that when I converted my reads it changes something but simply in my mind I want to reduce the size of my files. There is some problem in my FTP file transfer suite. I couldn't connect to that. But as I remember I also tried with fastq file with less read sequence after extracting some portion of my sequence read data but it again shows the same problem. But may be this time I will try SRA tool kit to extracts the read sequence.

About your BBMap suite I will try it today. Could you please tell me that is it possible to use it in windows. I don't have any experience on BBMap suite and is it difficult to install it??

Many many thanks for your reply:)) and comment:))

ADD REPLY • link updated 6.8 years ago by WouterDeCoster 47k • written 6.8 years ago by nusrat.bot • 0

0

Entering edit mode

Could you please tell me that is it possible to use it in windows. I don't have any experience on BBMap suite and is it difficult to install it??

You can use BBMap on windows as long as you install Java. There is no installation needed for BBMap. Download the software, uncompress and use. Take a look at this SeqAnswers thread for lots of help with BBMap (windows execution requires a slightly different syntax than one I included above). Ask if you run into any issues.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

What's your actual end goal? There's rarely a reason to convert fastq->fasta, for example.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Dear Devon,

My end goal is to run the data in Repeat Explorer pipeline:))

ADD REPLY • link 6.8 years ago by nusrat.bot • 0

0

Entering edit mode

Does the galaxy server you're using offer jupyter as an interactive environment? The error you're getting is because of the read names in the two files, which for some unknown reason the interlacer tool doesn't seem to be handling properly. The easiest solution, then, is to just not use that tool, but rather something else. It's pretty trivial to write a read interlacer, so if your galaxy instance supports jupyter as an interactive environment I can hack together some code.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Dear Devon,

I really don't know what environment that it follows. As because I am not a good bioinformatician like you people. And also I am not sure which way it can solve the problem:((

ADD REPLY • link 6.8 years ago by nusrat.bot • 0