Question

Extract neighbourhoods of mutations

0

Entering edit mode

6.1 years ago

Gene_MMP8 ▴ 240

I want to download the neighborhood of mutations in a cancer genome. Say, I have a mutation at location 'x' in a cancer sample. Now I want to extract 3 bases to the left and right of location 'x'. Now I have some download speed issues, due to which it's taking a lot of time to download those huge bam files. Is there any faster and easier way to get the neighbourhood data? How can I do that without downloading the entire cancer dataset? Is there any querying tool that allows me to extract specific location in a cancer genome, without actually downloading the entire dataset?

next-gen sequencing • 1.6k views

ADD COMMENT • link 6.0 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Now I want to extract 3 bases to the left and right of location 'x'. (...) a lot of time to download those huge bam

3 bases of what ? the reference sequence ? the bam (which reads would you use ) ?

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

3 bases from the cancer bam file.

ADD REPLY • link 6.1 years ago by Gene_MMP8 ▴ 240

1

Entering edit mode

but what happens if one base is clipped, another read is REF, another read is a variation another read is a insertion etc ??? an why would you want to do that ?

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yeah makes sense. I am really confused! Can you suggest any other way to extract the neighborhood of the mutations? Previously I used to make a consensus sequence of the processed bam file and then write a python script to extract the neighborhood. But that requires downloading of the bam files. Is there a way around that?

ADD REPLY • link 6.1 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

What do I do if there are large INDELS present ? Eg:
Ref: A T G G C G C A
Tumor bam: A C G TTA C T C A
Now I am using the first ref file to extract two bases to the left and right of the 6th variant. So in the ref file I will get GC and CA, but in the tumour file it is AC and CA. So how to proceed here?

ADD REPLY • link 6.0 years ago by Gene_MMP8 ▴ 240

1

Entering edit mode

Hello banerjeeshayantan,

could you please tell a little bit more about what you are trying to do and why. This will help a lot to give proper assistance.

fin swimmer

ADD REPLY • link 6.0 years ago by finswimmer 16k

0

Entering edit mode

Apologies for not making this clear. Actually, I wanted to extract 3 bases to the left and right of a given mutation. Initially, I thought of using the tumor bam file to achieve my required bases. But in another thread someone pointed out that this is unnecessary and just the Reference file and VCF file will be enough to fetch me the neighborhood bases. There is no such extra information that is present in the tumor.bam that is not already present in the Reference file. Also it is computationally expensive to query a bam file. So use the reference file instead.
Now I reasoned like my above reply:
Ref: A T G G C G C A
Tumor bam: A C G TTA C T C A
Now I am using the first ref file to extract two bases to the left and right of the 6th variant. So in the ref file, I will get GC and CA, but in the tumor file, it is AC and CA. So how to proceed here?

ADD REPLY • link 6.0 years ago by Gene_MMP8 ▴ 240

1

Entering edit mode

Take a VCF (mutations in variant call format)
Extract position of variation
Create two columns: variant position - 3, variant position + 3. This is near to bed format.

Extract sequences in tabular format using reference fasta and bed file. IMO, following is the way to take bases around variations:

For SNVs and insertions, use above logic.
For deletions, use ref allele. Calculate the genomic coordinates of ref allele (chromosomal start- chromosomal end). Subtract 3 from start poisiton of ref allele and add 3 to the end position. Those would be 3 bases before and 3 bases after variant
For delins/indels/DIVs, if ref allele is larger than alt allele, use logic 2. If ref allele is smaller than alt allele, use logic 1.

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

Hello,

do you have information about whether the variants next to your variants are on the same strand? Is this information important? If you are sure that those variants are co-located or it doesn't matter you could build a consensus sequence for the region.

From the bcftools manual:

# Create consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to".
samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa

fin swimmer

ADD REPLY • link 6.0 years ago by finswimmer 16k