Question

Identification of the sequence insertion site in the genome

3

Entering edit mode

7.4 years ago

valerie ▴ 100

Hi guys,

I have an insertion in my genome and I want to identify the insertion site. As far as I understand I should find the read pairs where one read is mapped to my insertion sequence, but mate is unmapped. I found these reads using samtools:

samtools view -F 4 -f 8 alignment.sam > alignment_filtered.sam

Now I need to find the unmapped mates. How can I do that? As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.

Thank you in advance

sequencing genome next-gen • 5.1k views

ADD COMMENT • link updated 3.0 years ago by LacquerHed ▴ 30 • written 7.4 years ago by valerie ▴ 100

Brian Bushnell · Answer 1 · 2016-12-04

1

Entering edit mode

7.4 years ago

Brian Bushnell 20k

You can do that with the BBMap package like this:

filterbyname.sh in=alignment.sam out=both.sam names=alignment_filtered.sam

That will give you all half-mapped pairs in the same file. For just the unmapped mates, you could filter it a second time with samtools, or with Reformat, like this:

reformat.sh in=both.sam out=unmapped.sam unmappedonly

ADD COMMENT • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you very much Brian! Do you know how could I build the insertion cite sequences using the unmapped reads? Have you ever worked with any de novo alignment tools? Or maybe it is better to map these reads to the reference genome? Thanks in advance.

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

I would do this:

bbmap.sh in1=r1.fq in2=r2.fq outm=eitherMapped.fq ref=insertion.fasta
tadpole.sh in=eitherMapped.fq out=assembled.fa

That will give you the insertion and its genomic context assembled together. Alternatively, rather than assembling, you could re-map the insert-mapped reads to the reference:

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam bs=bs.sh

Running the generated shellscript bs.sh will generate a sorted, indexed bam file that you can observe in a genome browser (as suggested by Genomax) such as IGV, which will clearly show the exact site of the insertion.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you very much, Brian! I just realized that this is you, who developed this aligner! Amazing! :) May I ask you one more question? As far as I understand bbmap is developed for short reads and my reads are length of 100. Is it ok?

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

bbmap will align reads of any length up to 6kbp (you will need to use mapPacbio.sh variant for reads over 600bp).

ADD REPLY • link updated 7.4 years ago by Brian Bushnell 20k • written 7.4 years ago by GenoMax 141k

0

Entering edit mode

Hi guys. I finally finished the computations and generated both assembled.fa and genomeMapped.sam for two of my insertion sequences. Now I want to look at genomeMapped.sam using IGV. I've opened the sam file and noticed that RNAME there is the name of my insertion sequence. How can I see the insertion site on my reference genome then?

One more question is related to assembled.fa. It is a file in a format of:

@contig_13,length=122,cov=24.9,gc=0.352 TCTGGTAGCTGCAGATGCTGTGCAAAACAGATTAAATTCTATCGCATGAGACTTCATTGCAAATCTTCTAAGAAGTATCAGGATGAATATGATGTTTCTTCTATCATTCCAAACTTTCAGAT + ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @contig_18,length=120,cov=10.2,gc=0.658 TTGAGCACCTCCTGACAGACGGGGCAGTAGGATGTGGCCGCGGAGAGGTCCTCAGCCATGGCGGGGGATGGGGGAACAAGGCGATGGTGACAGCGGCAGCGGCCCCGATGGTGACATGGT + ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

I want to compare two insertion cites and find out whether they are different or not. Can I do this by comparing two assembled.fa somehow or I should better use sam file and IGV for that?

Thank you in advance!

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

Am I right that it should be

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam ref = reference_genome.fasta bs=bs.sh

instead of your script (without ref=reference_genome.fasta )?

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

There can be no spaces between options and = sign.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

assembled.fa (inspite of what the name says) appears to be in fastq format. How did that happen? I think you may have typed the name in as assembled.fq when you ran the tadpole.sh command.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

You mean use it as reference for the next stage? Like

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam ref=assembled.fa bs=bs.sh

where assembled.fa is converted from fastq to fa? I thought about using reference genome, not assembled sequence as I want to find out on what chromosome my insertion is located, what genes are nearby, etc. If I use assembled.fa, as far as I understand, I will not be able to get this information.

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

C: Identification of the sequence insertion site in the genome says that you do one of those two (or both if you want to). They are independent options though. That is the way I read it.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

Sorry for that. May I ask you whether it will work the way I want it to work if I use whole reference genome sequence and reference for bbmap.sh?

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

If you use the whole genome as a reference for BBMap, then you will be able to look at the sorted, indexed bam in IGV and see the exact genomic context of the insertion event. Basically, there will be a point that reads don't span. This will be much easier to find if you only use pairs where at least one read maps to the insertion sequence, because then the only part of the genome that will have any coverage will be the part around the insertion.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much @Brian and @genomax2! I very much appreciate your help!

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

@Brian would hopefully be along in 2-3 hours to answer that.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

You are right! It was my typo!

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

If the BAM file represents a pretty small insertion, how can it be found once opened in IGV? Thanks.

ADD REPLY • link 3.0 years ago by LacquerHed ▴ 30

0

Entering edit mode

You could visually check with the alignment_filtered.sam file above using a genome browser. The site should be easily apparent. What kind of genome is this?

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

Human genome, if I understood your question correctly

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

Then you would need to locate reads (by name, you can search in IGV) that correspond to the ones present in unmapped.sam file. Those would be the ones mapped at/near the insertion site.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

I will try this! Thank you!

ADD REPLY • link 7.4 years ago by valerie ▴ 100

score 1 · Answer 2 · 2016-12-04

As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.

If you sequence is known, but you need to validate/find the insertion site then you have a number of options. If you augment your reference genome with your inserted sequence as an additional chromosome then identify your insertion site become the equivalent of identifying an interchromosomal translocation. You have a number of options and if you're looking for a tool that will do everything including the assembly then I would recommend manta (performs breakpoint assembly), or GRIDSS (performs breakend assembly).

If you're looking for an insertion site and you don't know the sequence being inserted (eg attempting to identify a novel virus) then it'a a significantly more difficult problem. In this case, you don't have many options and your best bet is probably NovelSeq as that's what it's designed for. GRIDSS could do a better job assembling the breakpoints but it doesn't yet include novel insertions in the output VCF, nor does it do de novo assembly to reconstruct the full length of the inserted sequence as NovelSeq does.

Disclaimer: I develop GRIDSS.

score 0 · Answer 3 · 2019-01-14

0

Entering edit mode

5.3 years ago

Vitis ★ 2.5k

In addition to the excellent tools recommended by the developer, I'll add that soft clipped reads can be queried and analyzed for insertions. Simply use a tool such as pysam to find the clipped reads.

https://pysam.readthedocs.io/en/latest/

ADD COMMENT • link 5.3 years ago by Vitis ★ 2.5k

0

Entering edit mode

I know this post is already vey old, but can you pls point to me where i can find out how to identify clipped reads using pysam. I can't find it in the docs.

ADD REPLY • link 5.2 years ago by Assa Yeroslaviz ★ 1.8k

1

Entering edit mode

Pysam helps you fetch the CIGAR strings for reads. CIGAR strings would give you information about clipping by "S" and "H" operations. For this specific problem, insertion would generate junctions between inserts and genome. Reads spanning the junctions would have clipping because for a junction read, half would match genome, and the other half would match the insert. By screening for these reads, you will be able to find the exact junction between insert and genome.

ADD REPLY • link 5.2 years ago by Vitis ★ 2.5k