Question: Are the Fastq files content is truly random
0
gravatar for yonis
4 months ago by
yonis0
yonis0 wrote:

Hi,

I'm building a system to generate fastq files as they are being output from HiSeq 3000 and HiSeq 4000. This will serve us to test our internal systems with some gold standards.

As we are working with microbiome a standard sample may contain around 1500 different species, is there a a bias in the output regarding how 'close' each assembly record appear next to each other or is it truly random?

If record 1,1 is from assembly 2913, what are the odds that record 1,2 is of that assembly as well (assuming 1500 species with the same strain length) ?

I couldn't find any paper on that so any help from experience would be great.

Thank

Yoni

hiseq fastq illumina • 232 views
ADD COMMENTlink modified 4 months ago by Istvan Albert ♦♦ 81k • written 4 months ago by yonis0

Assemblies do not come in fastq format, you probably have your terminology mixed up somehow.

ADD REPLYlink written 4 months ago by WouterDeCoster40k

You might find the Flux-Simulator software useful, as it handles simulation of sequencing data: http://confluence.sammeth.net/display/SIM/Demo+-+Create+Fastq+file , it's a bit involved to get started with but it'll save you a lot of work in the end.

If you're concerned about the distribution in the FASTQ file (I'm not sure why that would matter, since that wouldn't/shouldn't change how well the reads aligns), you could download a few datasets from GEO/SRA, align the reads, see where each read maps and investigate the distribution of mapping location/mapping quality to see if there's any patterns. Again, I don't think any of it would matter at all for any practical purposes.

ADD REPLYlink written 4 months ago by manuel.belmadani1.1k

I'm building a tool, down the pipeline ,that should only get a sample of a fastq file, to produce certain analysis. It matters a lot.

Thank you for the link, but as far as I can tell it doesn't simulate all the artifacts from a sequencing machine.

ADD REPLYlink written 4 months ago by yonis0

all the artifacts from a sequencing machine.

And what would those be?

ADD REPLYlink written 4 months ago by genomax70k
3
gravatar for Istvan Albert
4 months ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

The samples will be randomly distributed within a FASTQ file.

But remember that randomness will be governed by the relative abundances of the different DNA sources relative to one another.

The odds of getting two reads from the same genome next to one another are analogous of extracting balls of different colors from a bag that contains each color with a different proportion. Since the number of elements is large but also unknown the most appropriate would be to model it as sampling with replacement.

The order of the reads will follow the physical layout of the flow cell, sometimes that matters, some regions of the flowcell may produce better quality data.

ADD COMMENTlink written 4 months ago by Istvan Albert ♦♦ 81k

Sure, I should have added to the questions that they all have the same relative abundances.

ADD REPLYlink written 4 months ago by yonis0

Having exactly the same abundance is uncommon - unless the samples are created artificially - but do note that even if you had the same number of DNA molecules for each organism the size of each genome is most likely different. A longer genome will produce more reads. Thus the resulting reads will not be of equal proportion.

PS. I have now noticed that you do generate these as a simulation. The second part still applies. Genomes will produce reads proportionally to their lengths.

ADD REPLYlink modified 4 months ago • written 4 months ago by Istvan Albert ♦♦ 81k
0
gravatar for genomax
4 months ago by
genomax70k
United States
genomax70k wrote:

I'm building a system to generate fastq files as they are being output from HiSeq 3000 and HiSeq 4000.

AND

If record 1,1 is from assembly 2913, what are the odds that record 1,2 is

What do both of those sentences mean? You are trying to simulate fastq data?

Note: Fastq data as generated by a sequencer (raw sequence) is completely random until you do something to change that. If you doing paired-end sequencing then a matching pair of R1 and R2 reads represent sequence from two ends of a DNA fragment that is being sequenced.

ADD COMMENTlink modified 4 months ago • written 4 months ago by genomax70k

Yes. I'm simulating a fastq file as it being generated by HiSeq 3000 and HiSeq 4000. I wanted to know about the randomness of the data. Thanks.

ADD REPLYlink written 4 months ago by yonis0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1085 users visited in the last hour