Question: How does FASTQ format show a diploid sequence.
0
gravatar for Evoevo
12 weeks ago by
Evoevo0
Sydney
Evoevo0 wrote:

I've used a samtools, bcftools pipeline to generate a diploid consensus sequence. The consensus sequences are in fastq format. I expected that I'd get two sequences in the fastq files - one for each homologous chromosome. But when I open them, I can only see a single sequence identifier. How does FASTQ encode which base belongs to which homologous chromosomes? I can see a long string of n's in the middle of the sequence. Is that where they're separated?

Thanks!

Edited for clarity

sequence assembly • 284 views
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Evoevo0

What?

I've used samtools to assemble some diploid genomes.

samtools is not a genome assembler, you can't possibly have assembled a genome with it.

The assembled sequences are in fastq format.

Not impossible, but genome assemblies are almost always given in fasta format, not fastq.

How does FASTQ encode which base belongs to which homologous chromosomes?

It doesn't. Variation may be encoded in vcf format, or maybe fastg or gfa - the later two uncommon now but it will probably be prevalent in the near future. Currently, diploid genomes assemblies are generally represented as haploid fasta files.

I can see a long string of n's in the middle of the sequence. Is that where they're separated?

If your files do indeed represent a genome assembly, these runs of Ns probably represent scaffolds, that is, adjacent contigs with some undetermined sequence intervening, this undetermined sequence is filled up with Ns.

ADD REPLYlink written 12 weeks ago by h.mon9.2k

Sorry, I might be misusing the word 'assembled'. I'm just starting to play around with bioinformatics and I'm still wrapping my head around the jargon.

For clarity, I took a mapped .bam file from the 1000genomes project and ran it through samtools mpileup, passed that to bcftools call, then used vcfutils vcf2fq program to convert the resulting bcf file to fastq format (along with some additional filtering steps). What word should I have used rather than assembled? Variant called?

Thanks for your patience

ADD REPLYlink written 12 weeks ago by Evoevo0

Doesn't this require haplotype phasing?

ADD REPLYlink written 12 weeks ago by Michael Dondrup43k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1337 users visited in the last hour