Substitute of human whole genome fastq data.
2
0
Entering edit mode
5.9 years ago
shuksi1984 ▴ 60

Can I get substitute of human whole genome fastq file? As whole genome data is >100GBs, which requires huge RAM and HDD. The idea is to develop a pipeline to analyze real whole genome data with the current limited infrastructure. Once the pipeline is developed we can expand the infrastructure.

next-gen sequencing • 1.7k views
ADD COMMENT
1
Entering edit mode

Hello shuksi1984,

can you please elaborate a bit more on that. What are you meaning with "get substitute"?

fin swimmer

ADD REPLY
0
Entering edit mode

Hello Finswimmer,

Whole genome fastq file size is too large, which requires heavy computational infrastructure, hence was looking for an alternative, which can be used to develop a pipeline. Simply put, I am looking for a dataset (fastq file) which mimics human whole genome dataset.

ADD REPLY
3
Entering edit mode
5.9 years ago
Benn 8.3k

You can randomly select a subset of reads from the original fastq file, see here for different approaches.

ADD COMMENT
2
Entering edit mode
5.9 years ago

Get a bam with reads mapped to the human genome, and pick a chromosome. Pull out only the reads aligning to that chromosome and the unmapped reads. Use that chromosome as your reference.

ADD COMMENT
0
Entering edit mode

Some chromosomes are smaller than others :-D

ADD REPLY
0
Entering edit mode

Thank you for your suggestion. Can you tell me from where I can get BAM mapped to human genome?

Also, how to pull out the reads aligned to that chromosome and the unmapped reads? Among the autosomes, I suppose 22nd is the smallest. Hence, would be sensible to choose.

ADD REPLY
1
Entering edit mode

You can take a bam file from the 1000 Genomes Project, e.g. NA12878.

Use samtools view to create a bam file for only one chromosome. Afterwards you can use bedtools bam2fastq to create the fastq files. samtools can work on remote files. So there is no need to download the whole bam file before. The parameter -M introduced in v1.8 can speed up the process a lot.

BTW: Chromosome 21 is smaller than 22.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you so much. Will try with this alternative.

ADD REPLY
0
Entering edit mode

I gave following comand

samtools view -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam chr21 > chr21.bam

And my output is:

4.0K chr21.bam and 9.9M NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam.bai I believe, I have got correct output.

Now, my next step is to convert this bam file to fastq using tools such as SamToFastq (picard) or bedtools bam2fastq.

Other steps would remain same as whole genome pipeline?

ADD REPLY
1
Entering edit mode

Hello shuksi1984,

I don't believe you have the right output. This bam file is mapped against NCBI Build 37 and therefor the chromosome name doesn't have the prefix chr. The correct command should be:

samtools view -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam 21 > chr21.bam

I receive a file size of 3,0GB and the file contains 9036561 reads (checked with samtools flagstat).

fin swimmer

ADD REPLY
0
Entering edit mode

with the above given command, I received a file of 2.8G and 8856340 reads. I checked with fastqc. Both the fastq files has equal number of reads.

ADD REPLY
0
Entering edit mode

Hi Finswimmer,

The BAM file of 3.0GB file that contains 9036561 reads, is the reference genome for randomly selected subset of reads from the original fastq file? I am totally confused.

ADD REPLY
0
Entering edit mode

What version of samtools are you using?

ADD REPLY
0
Entering edit mode

Samtool version is 0.1.19-96b5f2294a

ADD REPLY
0
Entering edit mode

Updating would be a good idea. Not that it necessarily would make a difference here (it might) but try to stay current with important tools like samtools.

ADD REPLY
0
Entering edit mode

I checked it with this very, very old version. But I get the same result as above.

So I would think in your try the connection was interrupted in some way. Maybe a timeout, loss of internet connection, running out of space, ... Just retry (and please update samtools before).

Concerning your other question

Other steps would remain same as whole genome pipeline?

For now: yes. There are a lot of tweakings you can/must/should do the larger the datasets become. But in developing a pipeline it's fine to include the improvements later.

fin swimmer

ADD REPLY
0
Entering edit mode

The BAM that I receive from the above command can be used as reference genome? The BAM file should be converted to other file formats? Do I need to get input fastq files from the given strategy?

You can randomly select a subset of reads from the original fastq file, see here for different approaches.

The above comment is from the same thread. Kindly, see above.

ADD REPLY
0
Entering edit mode

The BAM that I receive from the above command can be used as reference genome? The BAM file should be converted to other file formats? Do I need to get input fastq files from the given strategy?

A bam is not a reference genome, the reference genome is in fasta format. You can convert a bam file to fastq files using samtools fastq.

ADD REPLY

Login before adding your answer.

Traffic: 3199 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6