Identification of the sequence insertion site in the genome
3
3
Entering edit mode
4.4 years ago
valerie ▴ 90

Hi guys,

I have an insertion in my genome and I want to identify the insertion site. As far as I understand I should find the read pairs where one read is mapped to my insertion sequence, but mate is unmapped. I found these reads using samtools:

samtools view -F 4 -f 8 alignment.sam > alignment_filtered.sam

Now I need to find the unmapped mates. How can I do that? As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.

sequencing genome next-gen • 2.5k views
1
Entering edit mode
4.4 years ago

You can do that with the BBMap package like this:

filterbyname.sh in=alignment.sam out=both.sam names=alignment_filtered.sam


That will give you all half-mapped pairs in the same file. For just the unmapped mates, you could filter it a second time with samtools, or with Reformat, like this:

reformat.sh in=both.sam out=unmapped.sam unmappedonly

0
Entering edit mode

Thank you very much Brian! Do you know how could I build the insertion cite sequences using the unmapped reads? Have you ever worked with any de novo alignment tools? Or maybe it is better to map these reads to the reference genome? Thanks in advance.

1
Entering edit mode

I would do this:

bbmap.sh in1=r1.fq in2=r2.fq outm=eitherMapped.fq ref=insertion.fasta


That will give you the insertion and its genomic context assembled together. Alternatively, rather than assembling, you could re-map the insert-mapped reads to the reference:

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam bs=bs.sh


Running the generated shellscript bs.sh will generate a sorted, indexed bam file that you can observe in a genome browser (as suggested by Genomax) such as IGV, which will clearly show the exact site of the insertion.

0
Entering edit mode

Thank you very much, Brian! I just realized that this is you, who developed this aligner! Amazing! :) May I ask you one more question? As far as I understand bbmap is developed for short reads and my reads are length of 100. Is it ok?

1
Entering edit mode

bbmap will align reads of any length up to 6kbp (you will need to use mapPacbio.sh variant for reads over 600bp).

0
Entering edit mode

Hi guys. I finally finished the computations and generated both assembled.fa and genomeMapped.sam for two of my insertion sequences. Now I want to look at genomeMapped.sam using IGV. I've opened the sam file and noticed that RNAME there is the name of my insertion sequence. How can I see the insertion site on my reference genome then?

One more question is related to assembled.fa. It is a file in a format of:

@contig_13,length=122,cov=24.9,gc=0.352 TCTGGTAGCTGCAGATGCTGTGCAAAACAGATTAAATTCTATCGCATGAGACTTCATTGCAAATCTTCTAAGAAGTATCAGGATGAATATGATGTTTCTTCTATCATTCCAAACTTTCAGAT + ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @contig_18,length=120,cov=10.2,gc=0.658 TTGAGCACCTCCTGACAGACGGGGCAGTAGGATGTGGCCGCGGAGAGGTCCTCAGCCATGGCGGGGGATGGGGGAACAAGGCGATGGTGACAGCGGCAGCGGCCCCGATGGTGACATGGT + ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

I want to compare two insertion cites and find out whether they are different or not. Can I do this by comparing two assembled.fa somehow or I should better use sam file and IGV for that?

0
Entering edit mode

Am I right that it should be

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam ref = reference_genome.fasta bs=bs.sh


1
Entering edit mode

There can be no spaces between options and = sign.

0
Entering edit mode

assembled.fa (inspite of what the name says) appears to be in fastq format. How did that happen? I think you may have typed the name in as assembled.fq when you ran the tadpole.sh command.

0
Entering edit mode

You mean use it as reference for the next stage? Like

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam ref=assembled.fa bs=bs.sh


where assembled.fa is converted from fastq to fa? I thought about using reference genome, not assembled sequence as I want to find out on what chromosome my insertion is located, what genes are nearby, etc. If I use assembled.fa, as far as I understand, I will not be able to get this information.

0
Entering edit mode

C: Identification of the sequence insertion site in the genome says that you do one of those two (or both if you want to). They are independent options though. That is the way I read it.

0
Entering edit mode

Sorry for that. May I ask you whether it will work the way I want it to work if I use whole reference genome sequence and reference for bbmap.sh?

1
Entering edit mode

If you use the whole genome as a reference for BBMap, then you will be able to look at the sorted, indexed bam in IGV and see the exact genomic context of the insertion event. Basically, there will be a point that reads don't span. This will be much easier to find if you only use pairs where at least one read maps to the insertion sequence, because then the only part of the genome that will have any coverage will be the part around the insertion.

0
Entering edit mode

Thank you so much @Brian and @genomax2! I very much appreciate your help!

0
Entering edit mode

@Brian would hopefully be along in 2-3 hours to answer that.

0
Entering edit mode

You are right! It was my typo!

0
Entering edit mode

If the BAM file represents a pretty small insertion, how can it be found once opened in IGV? Thanks.

0
Entering edit mode

You could visually check with the alignment_filtered.sam file above using a genome browser. The site should be easily apparent. What kind of genome is this?

0
Entering edit mode

Human genome, if I understood your question correctly

1
Entering edit mode

Then you would need to locate reads (by name, you can search in IGV) that correspond to the ones present in unmapped.sam file. Those would be the ones mapped at/near the insertion site.

0
Entering edit mode

I will try this! Thank you!

1
Entering edit mode
4.4 years ago
d-cameron ★ 2.3k

As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.

If you sequence is known, but you need to validate/find the insertion site then you have a number of options. If you augment your reference genome with your inserted sequence as an additional chromosome then identify your insertion site become the equivalent of identifying an interchromosomal translocation. You have a number of options and if you're looking for a tool that will do everything including the assembly then I would recommend manta (performs breakpoint assembly), or GRIDSS (performs breakend assembly).

If you're looking for an insertion site and you don't know the sequence being inserted (eg attempting to identify a novel virus) then it'a a significantly more difficult problem. In this case, you don't have many options and your best bet is probably NovelSeq as that's what it's designed for. GRIDSS could do a better job assembling the breakpoints but it doesn't yet include novel insertions in the output VCF, nor does it do de novo assembly to reconstruct the full length of the inserted sequence as NovelSeq does.

Disclaimer: I develop GRIDSS.

0
Entering edit mode
2.3 years ago
Vitis ★ 2.4k

In addition to the excellent tools recommended by the developer, I'll add that soft clipped reads can be queried and analyzed for insertions. Simply use a tool such as pysam to find the clipped reads.

0
Entering edit mode

I know this post is already vey old, but can you pls point to me where i can find out how to identify clipped reads using pysam. I can't find it in the docs.

1
Entering edit mode

Pysam helps you fetch the CIGAR strings for reads. CIGAR strings would give you information about clipping by "S" and "H" operations. For this specific problem, insertion would generate junctions between inserts and genome. Reads spanning the junctions would have clipping because for a junction read, half would match genome, and the other half would match the insert. By screening for these reads, you will be able to find the exact junction between insert and genome.