Question: How does FASTQ format show a diploid sequence.
1
gravatar for Evoevo
9 months ago by
Evoevo10
Sydney
Evoevo10 wrote:

I've used a samtools, bcftools pipeline to generate a diploid consensus sequence. The consensus sequences are in fastq format. I expected that I'd get two sequences in the fastq files - one for each homologous chromosome. But when I open them, I can only see a single sequence identifier. How does FASTQ encode which base belongs to which homologous chromosomes? I can see a long string of n's in the middle of the sequence. Is that where they're separated?

Thanks!

Edited for clarity

sequence assembly • 529 views
ADD COMMENTlink modified 3 months ago by DragonDNA100 • written 9 months ago by Evoevo10
1

What?

I've used samtools to assemble some diploid genomes.

samtools is not a genome assembler, you can't possibly have assembled a genome with it.

The assembled sequences are in fastq format.

Not impossible, but genome assemblies are almost always given in fasta format, not fastq.

How does FASTQ encode which base belongs to which homologous chromosomes?

It doesn't. Variation may be encoded in vcf format, or maybe fastg or gfa - the later two uncommon now but it will probably be prevalent in the near future. Currently, diploid genomes assemblies are generally represented as haploid fasta files.

I can see a long string of n's in the middle of the sequence. Is that where they're separated?

If your files do indeed represent a genome assembly, these runs of Ns probably represent scaffolds, that is, adjacent contigs with some undetermined sequence intervening, this undetermined sequence is filled up with Ns.

ADD REPLYlink written 9 months ago by h.mon15k

Sorry, I might be misusing the word 'assembled'. I'm just starting to play around with bioinformatics and I'm still wrapping my head around the jargon.

For clarity, I took a mapped .bam file from the 1000genomes project and ran it through samtools mpileup, passed that to bcftools call, then used vcfutils vcf2fq program to convert the resulting bcf file to fastq format (along with some additional filtering steps). What word should I have used rather than assembled? Variant called?

Thanks for your patience

ADD REPLYlink written 9 months ago by Evoevo10

Doesn't this require haplotype phasing?

ADD REPLYlink written 9 months ago by Michael Dondrup44k
1
gravatar for DragonDNA
3 months ago by
DragonDNA100
Durham
DragonDNA100 wrote:

It uses the ambiguity codes. For example "nucleotide" Y means that position is heterozygote for C and T nucleotides. If you have a Y in your sliding window, PSMC calls that window to have heterozygosity. SNPs are converted into these IUPAC codes to represent diploid information in a haploid-like single sequence in the prior step.

ADD COMMENTlink written 3 months ago by DragonDNA100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 918 users visited in the last hour