Question: bam file assembly (1000 genomes)
2
gravatar for eyb
4.5 years ago by
eyb180
Russian Federation
eyb180 wrote:

My goal is to find two 19bp long sequences (primers) in some samples from 1000 genomes project. In order to see if there is a deletion of interest. I have downloaded bam file and extracted a region of interest using samtools. I believe reads in bam file are already aligned. How do I assemble reads into a single sequence so I can search for primers on it?

After some googling I found a software called velvet. But it says in the description that it is for very short reads. Are there any alternatives which take bam files? Is velvet a good tool for my task?

ADD COMMENTlink modified 4.5 years ago by Jorge Amigo11k • written 4.5 years ago by eyb180

how about if you convert your bam into sam and look into it ! it must have the reads information eventhough its aligned.

ADD REPLYlink written 4.5 years ago by Quak300
Wouldn't it contain the same 70bp reads in a table format? It seems that searching through sam file wouldn't be much different from searching through bam file using samtools view input.bam | grep "pattern" which I don't think is right, because in my understanding command will not match a pattern if pattern is spread apart on different lines. I already did this, and it found nothing. At the moment I just checked one sample though, but I want to make sure that there isn't such sequence before marking that a sample contains deletion.
ADD REPLYlink written 4.5 years ago by eyb180
2
gravatar for zam.iqbal.genome
4.5 years ago by
United Kingdom
zam.iqbal.genome1.7k wrote:

One of the things the 1000 Genomes (led by Richard Durbin's group) is doing is building a "read server", which effectively allows you to make web queries for primers (or kmers or haplotypes) and it will tell you which sample has them. Right now it's not yet publicly available, but it will be soon.

Right now I think your best option is to check the VCFs more carefully and see if your variant is there - you don't need to download all the VCFs (which are enormous) - you can just get the file that lists all the sites, and see if there is anything resembling yours

ADD COMMENTlink written 4.5 years ago by zam.iqbal.genome1.7k
1

Ooo, now that'll be a nice feature!

ADD REPLYlink written 4.5 years ago by Devon Ryan89k

That would be great feature.

 

PS I replied to your comment below.

ADD REPLYlink written 4.5 years ago by eyb180
2
gravatar for Jorge Amigo
4.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

so what you want is to end up with just a single fasta sequence from a bam file, and then try to find your primers in it? if that is the case, you "only" need to convert the bam file into a fasta sequence and then perform some simple pattern matching.

first, get the variants out of the bam file and index them:

samtools mpileup -uf hg19.fa in.bam \
| bcftools call -vm -Oz - > out.vcf.gz
tabix -p vcf out.vcf.gz

then, extract the reference section you're interested in, build the consensus applying the variants, join all the fasta sequence lines into a single one per each fasta entry (for later searching purposes), and compress it in case it's expected to be a big file:

samtools faidx hg19.fa chr22:29000000-29200000 \
| vcf-consensus out.vcf.gz \
| perl -pe '/^>/ ? print "\n" : chomp' \
| tail -n +2 \
| bgzip > out.fasta.gz

finally, count the number of lines that match your primers:

zgrep -cf primers.txt out.fasta.gz

I used regions chr22:29000000-29200000 for this example, but any other region/s can be selected. have in mind that if you don't specify any region/s, the samtools faidx command will output the entire genome.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Jorge Amigo11k

Thanks. I am downloading reference genome at the moment. Will tell you if it worked later.

ADD REPLYlink written 4.5 years ago by eyb180
1
gravatar for Devon Ryan
4.5 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

You might just convert the reads near that area back to fastq (e.g. with samtools bam2fq)and assemble that. If 1000 genomes found that deletion, then it might be easier to just look at the sample genotypes in the area neighboring the deletion and modify the reference sequence to match (you'd then design primers against that).

ADD COMMENTlink written 4.5 years ago by Devon Ryan89k

The primers are already Identified. My goal is to find them in samples. For example if I find a primer in a sample I can tell that there is no deletion. How do I get a sequence string from fastq?

ADD REPLYlink written 4.5 years ago by eyb180
1

You would assemble the fastq files with velvet or something similar. Having said that, it'd seem vastly easier to just convert the section of the reference genome according to the VCF file from 1000 genomes. Also, if your primers are shorter than the read length (pretty likely), then you could just directly search the bam file (samtools view foo.bam some_region | grep your_primer_sequence).

ADD REPLYlink written 4.5 years ago by Devon Ryan89k

I already searched VCF from 1000 genomes project, but I did not found my deletion there. The longest deletion there was about 300bp and mine is more than 10kbp. I also have ran the command you mention, but it did not found anything (and yes my primer is much shorter than a read). Will it still work if primer is spread across several reads? For example 3 letters on read one, anther 3 in in read 2 and so on...

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by eyb180
1

If the longest thing you saw was 300bp then you looked in the wrong VCF file - they 1000g has called stuff that is many tens of kb long. I'll try and find a path for the latest VCFs

ADD REPLYlink written 4.5 years ago by zam.iqbal.genome1.7k

I have used chr1 file from this directory (deletion I am looking for is located on this chromosome). I have also used VCFtools to extract indels and then I used following options to get information about indels with following options:

--vcf CEU_indels.recode.vcf
--hist-indel-len
--out indels_hist

The output was table which looked like this:

LENGTH    COUNT    PRCT
-129    1    0.000388286

I believe the first column is the length of in/del. So the longest insertion was 389 and longest deletion was -129.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by eyb180

Ah, a deletion that large likely wouldn't have been called. You'll need to either reassemble that region or run the BAM file through a deletion caller designed to find events of that size.

No, grep won't find sequences split between reads, but if any read contained it then it would have been found.

ADD REPLYlink written 4.5 years ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 831 users visited in the last hour