bam file assembly (1000 genomes)
3
2
Entering edit mode
7.8 years ago
eyb ▴ 230

My goal is to find two 19bp long sequences (primers) in some samples from 1000 genomes project. In order to see if there is a deletion of interest. I have downloaded bam file and extracted a region of interest using samtools. I believe reads in bam file are already aligned. How do I assemble reads into a single sequence so I can search for primers on it?

After some googling I found a software called velvet. But it says in the description that it is for very short reads. Are there any alternatives which take bam files? Is velvet a good tool for my task?

1000 genomes Assembly alignment primer • 3.9k views
0
Entering edit mode

how about if you convert your bam into sam and look into it ! it must have the reads information eventhough its aligned.

0
Entering edit mode
Wouldn't it contain the same 70bp reads in a table format? It seems that searching through sam file wouldn't be much different from searching through bam file using samtools view input.bam | grep "pattern" which I don't think is right, because in my understanding command will not match a pattern if pattern is spread apart on different lines. I already did this, and it found nothing. At the moment I just checked one sample though, but I want to make sure that there isn't such sequence before marking that a sample contains deletion.
2
Entering edit mode
7.8 years ago

One of the things the 1000 Genomes (led by Richard Durbin's group) is doing is building a "read server", which effectively allows you to make web queries for primers (or kmers or haplotypes) and it will tell you which sample has them. Right now it's not yet publicly available, but it will be soon.

Right now I think your best option is to check the VCFs more carefully and see if your variant is there - you don't need to download all the VCFs (which are enormous) - you can just get the file that lists all the sites, and see if there is anything resembling yours

1
Entering edit mode

Ooo, now that'll be a nice feature!

0
Entering edit mode

That would be great feature.

PS I replied to your comment below.

2
Entering edit mode
7.8 years ago

so what you want is to end up with just a single fasta sequence from a bam file, and then try to find your primers in it? if that is the case, you "only" need to convert the bam file into a fasta sequence and then perform some simple pattern matching.

first, get the variants out of the bam file and index them:

samtools mpileup -uf hg19.fa in.bam \
| bcftools call -vm -Oz - > out.vcf.gz
tabix -p vcf out.vcf.gz

then, extract the reference section you're interested in, build the consensus applying the variants, join all the fasta sequence lines into a single one per each fasta entry (for later searching purposes), and compress it in case it's expected to be a big file:

samtools faidx hg19.fa chr22:29000000-29200000 \
| vcf-consensus out.vcf.gz \
| perl -pe '/^>/ ? print "\n" : chomp' \
| tail -n +2 \
| bgzip > out.fasta.gz

finally, count the number of lines that match your primers:

zgrep -cf primers.txt out.fasta.gz

I used regions chr22:29000000-29200000 for this example, but any other region/s can be selected. have in mind that if you don't specify any region/s, the samtools faidx command will output the entire genome.

0
Entering edit mode

1
Entering edit mode
7.8 years ago

You might just convert the reads near that area back to fastq (e.g. with samtools bam2fq)and assemble that. If 1000 genomes found that deletion, then it might be easier to just look at the sample genotypes in the area neighboring the deletion and modify the reference sequence to match (you'd then design primers against that).

0
Entering edit mode

The primers are already Identified. My goal is to find them in samples. For example if I find a primer in a sample I can tell that there is no deletion. How do I get a sequence string from fastq?

1
Entering edit mode

You would assemble the fastq files with velvet or something similar. Having said that, it'd seem vastly easier to just convert the section of the reference genome according to the VCF file from 1000 genomes. Also, if your primers are shorter than the read length (pretty likely), then you could just directly search the bam file (samtools view foo.bam some_region | grep your_primer_sequence).

0
Entering edit mode

1
Entering edit mode

If the longest thing you saw was 300bp then you looked in the wrong VCF file - they 1000g has called stuff that is many tens of kb long. I'll try and find a path for the latest VCFs

0
Entering edit mode

I have used chr1 file from this directory (deletion I am looking for is located on this chromosome). I have also used VCFtools to extract indels and then I used following options to get information about indels with following options:

--vcf CEU_indels.recode.vcf
--hist-indel-len
--out indels_hist


The output was table which looked like this:

LENGTH    COUNT    PRCT
-129    1    0.000388286


I believe the first column is the length of in/del. So the longest insertion was 389 and longest deletion was -129.

0
Entering edit mode

Ah, a deletion that large likely wouldn't have been called. You'll need to either reassemble that region or run the BAM file through a deletion caller designed to find events of that size.

No, grep won't find sequences split between reads, but if any read contained it then it would have been found.