How do I make a fasta reference file from a set of bam files in order to maximize the number of heterozygotes in my reference genome?
1
0
Entering edit mode
18 months ago

Through samtools, I merged all the .bam files that I wanted to be part of my reference genome.

samtools merge F1.bam *F1*_sorted.bam


I ordered and created the index of my bam file.

samtools sort -m 768M F1.bam -o F1_sorted.bam
samtools index F1_sorted.bam


After that I created a fasta file from my bam.

  samtools mpileup -uf genome.fa F1.bam | bcftools call -c | vcfutils.pl vcf2fq > F1.fastq
seqtk seq -a F1.fastq  > F1.fasta


And finally I extracted the desired sequences, which corresponds to 2 bases of each side of an detected SNP t in my study population. For that, I used SNPregionfile.txt as the coordinate map of the SNP list.

samtools faidx F1.fasta --region-file SNPregionfile.txt > F1_seq.txt
cat F1_seq.txt |grep -v ">" > F1_seq_final.txt


Then I performed the same process with subgroups of the F1 population. My idea was to compare the references of the main group with the sub-groups, to see which alleles were fixed in each one of the subgroups, but variable in the entire population. The problem is that I realized that for the same allele, the reference genome considering all individuals was often homozygous and considering a subset of the population, it was heterozygous. Then I discovered that the heterozygous call algorithm is more complex than I imagined and it is clear that it is not enough to have some heterozygous individuals for that allele to be considered as such, but a considerable proportion of the counts.

So, my question is, how to create a reference genome fasta file using the pipeline presented here but considering any possibility of my data ending up as heterozygous in any of the individuals used to create the merged bam file.

samtools mpileup bcftools vcfutils seqtk • 1.3k views
1
Entering edit mode

it is bit unclear what you are after. A reference genome cannot be heterozygous since what it typically means is a single sequence, ( a single FASTA record) per chromosome. You could of course have a diploid reference, but even then you'd have two FASTA sequences, each having all the variants in phase as a haplotype.

What you seem to be doing up above is subselecting some reads based on some properties (variants) that they carry. I would not call that a reference.

If you want to create a consensus sequence that say contains the major alleles that is a different thing altogether and should be pursued differently.

0
Entering edit mode

Thanks for the help, but my concern is not exactly to find the correct name for the file I am currently calling "reference genome" of my bam files. I called like it to be easer to understand, sorry if it caused more confusion. And you are right, I am referring to a fasta consensus. My question is, how to make a this fasta from my bam files being that: if at a given position of the bam file there is only reads containing a cytosine (C) in that position, the reference position will contain a letter C, if there is only T, the reference will be a T and if there is any ratio between C and T, the reference will be Y (according to the iupac dna code). Off course this would be a perfect scenario, since as the bam is a merge of many individual, I can have other alleles, or indels etc... In other words, I wondering to know how to change the criteria of calling heterozygous genotype in my data for a diploid organism.

1
Entering edit mode

so it sounds like you would like to determine a consensus sequence based on variants, but, in addition, you'd want the consensus to use ambiguity codes.

as h.mon pointed out bctools consensus can do that.

0
Entering edit mode

I agree with Istvan Albert , It is unclear what you are trying to achieve. Maybe if you explain in detail your end goal, someone could provide more to the point advice.

And finally I extracted the desired sequences, which corresponds to 2 bases of each side of an detected SNP

I really don't get this. If there were variants around the variant of interest, they would be present in the vcf. Why all this trouble to find information which should already be available?

Anyway, based on your question (but with the caveat I don't really understand what you are trying to accomplish), it seems to me you can try two different approaches:

1. using graph genomes, see, for example, vg.

2. call the variants for each sample separately, merge the vcfs, and then subsample the vcfs according to the desired subpopulations.

As final comment: it seems bcftools consensus is preferred over vcftools.pl vcf2fq.

0
Entering edit mode

I am interested in the neighboring bases, I want to know if it is a CpG or CCpGG - SNP, this is why I have the coordinate of the SNPs neighbor bases.

Looks you are providing a good tip, thank you, maybe with the consensus function I can define a criterion for calling heterozygous positions. However, after I get the consensus, I need a command to find specific coordinates. That would be the neighboring bases of some SNPs I am interested in. At the moment, I know how to do this with fasta, and this is the reason why I created the fasta and I was trying to figure it out through this way.

0
Entering edit mode
18 months ago

I figured out how to solve my problem. I was trying to generate a consensus from the merged file instead of considering all the individual genotypes separately to create the consensus. To figure out you can use F1_sorted.bam when I you create the vcf file as follow:

samtools mpileup -uf genome.fa *F1*_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > F1.fastq
seqtk seq -a F1.fastq  > F1.fasta