Question

How to get nucleotide sequence from BAM file

0

Entering edit mode

8.1 years ago

davonairos • 0

Hi everyone! I apologize in advance for my ignorance. I'm just beginning in in a new bioinformatics project. I have a BAM file generated from a fastq file from a whole genome sequencing (C. elegans). I want to find in which site (or sites) a construct has been integrated in the genome, comparing my sequence with a published genome. I've been analyzing the SNPs and INDELs with the IGV software using VCF files, but there are no tools to search for specific sequences in a sequenced genome VCF file. Because of this, I'm interested in generate a FASTA (or any other format) file from the BAM alignment file, to be able to blast the sequence of the construct against my whole genome sequence. Does anyone know any software, or another strategy to achieve my objective? I don't know if I explained well, I hope so. Thank you so much!

David

sequence alignment blast Assembly • 4.2k views

ADD COMMENT • link updated 8.1 years ago by WouterDeCoster 48k • written 8.1 years ago by davonairos • 0

0

Entering edit mode

bam to fastq here

ADD REPLY • link 8.1 years ago by cpad0112 21k

0

Entering edit mode

I'm not sure if your approach is the best for this. I would start with the fastq file and search your construct in there, and see where the rest of the 'positive' reads map.

Essentially you are looking for a big structural variant (an insertion), right?

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

0

Entering edit mode

Yes, that's correct. I'm looking for an insertion, a big one (and probably integrated more than once in the genome). Do you mean I should to convert my BAM file into fastq? Because the original fastq from the sequencing service need to be trimmed and aligned before, right?

ADD REPLY • link 8.1 years ago by davonairos • 0

0

Entering edit mode

Due to the huge insertion, it's possible that alignment wasn't that successful. I would suggest looking into structural variant calling tools.

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

0

Entering edit mode

That could be useful! thank you

ADD REPLY • link 8.1 years ago by davonairos • 0

score 0 · Answer 1 · 2017-06-14

0

Entering edit mode

8.1 years ago

Brian Bushnell 20k

C.elegans is not very big. Depending on the length of the sequence you wanted to add/delete, I would typically open the VCF in a text editor and CTRL-F search for it. Or grep for it. That is, if it's an indel. If it's a SNP or complex change, that's harder; you have to look for the position. But it's difficult for me to imagine an experiment where you are trying to do so many changes to a genome simultaneously, that the resulting VCF is too big to deal with manually; if there is a complex change, I'd just look at the region in IGV. You can easily filter out the variations already present in the base strain to leave only those present in the mutant strain, to simplify analysis of the VCF.

You can, if you want, use GATK's alternate reference maker for this purpose. But I think you might be approaching it from the wrong angle. So, perhaps you could explain in more detail what you're doing; as far as I can tell, BLAST should not be involved with this process at all.

ADD COMMENT • link 8.1 years ago by Brian Bushnell 20k

0

Entering edit mode

If OP is looking for the integration site(s) of a construct he's looking for a big insertion, right? So this boils down to a structural variant identification, of which I'm not sure it will be present in a VCF generated by "standard SNP calling software". It may or may not, right?

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

0

Entering edit mode

Yes, I'm sorry. I should explain better what I am doing. I had a C. elegans strain with a construct integrated in the genome. I took this strain, which has a given phenotype, to perform a forward genetic screen by EMS mutagenesis, in order to find mutants that annul the parental phenotype. By using workflows from CloudMap, available on Galaxy web platform, I got VCF files from the original FASTQ files, and analyzed all the possible phenotype-generating variants. I did this part without problems, but my interest right now is focused on finding where (and how many times) the construct is integrated in the parental (and in the mutagenized strains as well) genome taking advantage of having the whole genome sequenced. I thought, based on the small file size, that the VCF files don't really contain the sequence, but only information about the position of variants regarding a reference published genome. Am I right? So, in this case, I wouldn't be able to find any sequence in a VCF file.

ADD REPLY • link 8.1 years ago by davonairos • 0

score 0 · Answer 2 · 2017-06-15

0

Entering edit mode

8.1 years ago

Benn 8.4k

Check here, the option suggested by Istvan Albert.

You can also get smaller pieces if you know the coordinates in the genome of your target of interest. This is also done with samtools view (see manual).

ADD COMMENT • link 8.1 years ago by Benn 8.4k

score 0 · Answer 3 · 2017-06-17

0

Entering edit mode

8.1 years ago

WouterDeCoster 48k

I just saw this article which might be of interest for you: Ub-ISAP: a streamlined UNIX pipeline for mining unique viral vector integration sites from next generation sequencing data