Question: Substituting Snps On Reference Genome Assemblies
1
gravatar for Sakti
6.4 years ago by
Sakti390
United States
Sakti390 wrote:

Hi again guys!

So this time with a new question. I have a list of variant calls (basically, SNPs) for a mouse genome sample. SNP positions are reported using the reference mouse genome sequence. What I would like to do is to take this list and substitute each reported SNP position on the corresponding reference mouse genome sequence as to obtain my own "genome" with the corresponding SNPs. However, I don't know how to do this task in an efficient manner just using my current programming skills.

I was thinking on storing each bp as part of an array, and then reading my SNPs file and then changing, but wouldn't storing a whole chromosome sequence be too much?

I'd like to know if you know of any other better way :)

Thanks!!

genome mouse snp perl • 5.4k views
ADD COMMENTlink modified 15 months ago by nataliagru130 • written 6.4 years ago by Sakti390
4
gravatar for Ashutosh Pandey
6.4 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

This link provides the tool to create strain specific reference genome for effective mapping of the reads.

http://alleleseq.gersteinlab.org/tools.html

Download personnel genome constructor. This script will take vcf file and reference genome as input and will create enhanced reference genome as output. As this script is meant for human, two versions including maternal and paternal for each chromosome will be created. But as reference genome of mouse is an inbred organism, it has the same allele on both the chromosomes. so you can consider any copy either paternal or maternal. But make sure your vcf file has the homozygous variant tagged as "1/1" for GT tag and you should remove any heterozygous variants if any in the vcf file to skip the unnecessary complications.

I also have this script that I wrote a long ago. If you know very basic python you should be able to modify it to work for your case. It exactly does what you need. Here is the link: https://github.com/ashutoshkpandey/SimplePrograms/blob/master/psuedogenome_SNPs.py

ADD COMMENTlink modified 4.9 years ago • written 6.4 years ago by Ashutosh Pandey11k

This sounds great, will check it out.

ADD REPLYlink written 6.4 years ago by Sakti390

couldn't access the python script, can you make it available again? thanks

ADD REPLYlink written 5.3 years ago by eva10
1
gravatar for JC
6.4 years ago by
JC8.7k
Mexico
JC8.7k wrote:

Loading a full chromosome as an array is never a good idea, you can load the full chromosome as a "string" and modify the positions with "substr", something like this:

#!/usr/bin/perl

my %pos = loadPositions($file); # returns a hash like this $pos{'chr1'}{121726} = 'G'
my %seq = loadSequences($fasta); # returns a hash like this $seq{'chr1'} = 'ATCGATCATCTACTA....';

while (my ($chr, $seq) = each %seq) {
    foreach my $pos (keys %{ $pos{$chr} }) {
        substr($seq, $pos - 1, 1) = $pos{$chr}{$pos}; # be careful with 0-based and 1-based coordinates
    }
    printFasta($seq);
}

In this case, I'm loading the full genome, each chromosome as a "string", but you can optimize the code to do it chromosome by chromosome.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by JC8.7k

I think sorting the positions in reverse would be prudent, or indels will throw off your numbering. If you work from the back to the front, your numbering will stay intact.

ADD REPLYlink written 6.4 years ago by swbarnes26.7k

Thank you very much JC!! Will give it a try, but where is the loadSequences function? Is that part of a library in perl?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Sakti390

that was a simple example for use "substr" to change a position in a string, it is not the complete script

ADD REPLYlink written 6.4 years ago by JC8.7k
0
gravatar for FatihSarigol
2.9 years ago by
FatihSarigol130
Durham
FatihSarigol130 wrote:

Anyone still coming to this page, if you have a fasta reference genome and a bam file that you want to turn into the reference file by changing SNP's and N's, you may try this one-liner using samtools, bcftools and vcfutils.pl (ps for beginners: both samtools and bcftools can be compiled in a computing cluster or in Linux, if so just add the locations of each before the software name; vcfutils is already a perl script from bcftools)

samtools mpileup -d8000 -q 20 -Q 10 -uf  REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq  > OUTPUT.fastq

-d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)

Which generates a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:

sed -i -e '/^+/,/^\@/{/^+/!{/^\@/!d}}; /^+/ d; s/@/>/g' OUTPUT.fastq

In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.

EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

ADD COMMENTlink modified 20 months ago • written 2.9 years ago by FatihSarigol130

After running mpileup and the sed command, the output.fastq can then be successfully converted to .fasta?

Thanks for touching on this process, it is very helpful for my work.

ADD REPLYlink written 2.3 years ago by TrentGenomics30

My version of samtools 1.4 doesn't have vcfutils.pl. It does have samtools.pl, which has pileup2fq, so you might want to try that.

You can make your own pileup2fa pretty easily, by editing the samtools.pl script. perl scripts are plain texts, so you can look at them easily.

Find the the code below, and change this

print "\@$chr\n"; &p2q_print_str($seq);
  print "+\n"; &p2q_print_str($qual);

To

print "\>$chr\n"; &p2q_print_str($seq);
  #print "+\n"; &p2q_print_str($qual);
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by swbarnes26.7k

Would 'bcftools consensus' be a similar function to the above commands?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by TrentGenomics30

michbrown, sure you can convert fastq to fasta afterwards, try this simple awk script for example:

awk 'NR%4==1{print ">"substr($0,2)}NR%4==2{print $0}' file.fastq > file.fasta

I never tried bcftools consensus, looks similar to bcftools -c for output of samtools mpileup, did you try consensus? I also see now on bcftools webpage this

"Users are now required to choose between the old samtools calling model (-c/--consensus-caller) and the new multiallelic calling model (-m/--multiallelic-caller). The multiallelic calling model is recommended for most tasks."

so one should give that a try, too.

ADD REPLYlink written 2.2 years ago by FatihSarigol130

swbarnes2, looks like a good solution, I also like changing scripts of softwares written in python sometimes. vcfutils.pl comes with bcftools in 2 versions I have used, check the installation folders: (( bcftools-1.4/vcfutils.pl )) (( bcftools-1.3/bin/vcfutils.pl )) (but I did hear that samtools also had a similar script before)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by FatihSarigol130
0
gravatar for nataliagru1
15 months ago by
nataliagru130
nataliagru130 wrote:

why did you set -d to 8000 dont you want a minimum reads of 10 or max 250? 8000 seems a bit excessive right?

I am still learning what is the best way to execute mpileup. So I am wondering why you set read depth to 8000. Thank you

ADD COMMENTlink written 15 months ago by nataliagru130

Hi Natalia, I guess this is a question for my code up there. That 8000 was the default value of maximum depth selected by samtools mpileup (I just checked it still seems to be the same in the manual) I guess I had written that to indicate an available option that is sometimes important to change, no reason behind setting it to default. It is "max per-file depth" that "avoids excessive memory usage"

EDIT update for samtools 1.9 version, quoting from github page: "Samtools mpileup now handles the '-d' max_depth option differently. There is no longer an enforced minimum, and '-d 0' is interpreted as limitless (no maximum - warning this may be slow). The default per-file depth is now 8000, which matches the value mpileup used to use when processing a single sample. To get the previous default behaviour use the higher of 8000 divided by the number of samples across all input files, or 250."

ADD REPLYlink modified 15 months ago • written 15 months ago by FatihSarigol130
1

Thank you so much!!!

Greatly appreciated!!

ADD REPLYlink written 14 months ago by nataliagru130

Do not add answers unless you're answering the top level question. Use the Add Comment/Add Reply options instead. Read posts under https://www.biostars.org/t/how-to to understand how to use the forum.

ADD REPLYlink written 14 months ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1898 users visited in the last hour