Question: Identification of the sequence insertion site in the genome
2
gravatar for valerie
2.2 years ago by
valerie60
valerie60 wrote:

Hi guys,

I have an insertion in my genome and I want to identify the insertion site. As far as I understand I should find the read pairs where one read is mapped to my insertion sequence, but mate is unmapped. I found these reads using samtools:

samtools view -F 4 -f 8 alignment.sam > alignment_filtered.sam

Now I need to find the unmapped mates. How can I do that? As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.

Thank you in advance

sequencing next-gen genome • 1.3k views
ADD COMMENTlink modified 5 weeks ago by Vitis1.9k • written 2.2 years ago by valerie60
1
gravatar for Brian Bushnell
2.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can do that with the BBMap package like this:

filterbyname.sh in=alignment.sam out=both.sam names=alignment_filtered.sam

That will give you all half-mapped pairs in the same file. For just the unmapped mates, you could filter it a second time with samtools, or with Reformat, like this:

reformat.sh in=both.sam out=unmapped.sam unmappedonly
ADD COMMENTlink written 2.2 years ago by Brian Bushnell16k

Thank you very much Brian! Do you know how could I build the insertion cite sequences using the unmapped reads? Have you ever worked with any de novo alignment tools? Or maybe it is better to map these reads to the reference genome? Thanks in advance.

ADD REPLYlink written 2.2 years ago by valerie60
1

I would do this:

bbmap.sh in1=r1.fq in2=r2.fq outm=eitherMapped.fq ref=insertion.fasta
tadpole.sh in=eitherMapped.fq out=assembled.fa

That will give you the insertion and its genomic context assembled together. Alternatively, rather than assembling, you could re-map the insert-mapped reads to the reference:

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam bs=bs.sh

Running the generated shellscript bs.sh will generate a sorted, indexed bam file that you can observe in a genome browser (as suggested by Genomax) such as IGV, which will clearly show the exact site of the insertion.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Brian Bushnell16k

Thank you very much, Brian! I just realized that this is you, who developed this aligner! Amazing! :) May I ask you one more question? As far as I understand bbmap is developed for short reads and my reads are length of 100. Is it ok?

ADD REPLYlink written 2.2 years ago by valerie60
1

bbmap will align reads of any length up to 6kbp (you will need to use mapPacbio.sh variant for reads over 600bp).

ADD REPLYlink modified 2.2 years ago by Brian Bushnell16k • written 2.2 years ago by genomax62k

Hi guys. I finally finished the computations and generated both assembled.fa and genomeMapped.sam for two of my insertion sequences. Now I want to look at genomeMapped.sam using IGV. I've opened the sam file and noticed that RNAME there is the name of my insertion sequence. How can I see the insertion site on my reference genome then?

One more question is related to assembled.fa. It is a file in a format of:

@contig_13,length=122,cov=24.9,gc=0.352 TCTGGTAGCTGCAGATGCTGTGCAAAACAGATTAAATTCTATCGCATGAGACTTCATTGCAAATCTTCTAAGAAGTATCAGGATGAATATGATGTTTCTTCTATCATTCCAAACTTTCAGAT + ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @contig_18,length=120,cov=10.2,gc=0.658 TTGAGCACCTCCTGACAGACGGGGCAGTAGGATGTGGCCGCGGAGAGGTCCTCAGCCATGGCGGGGGATGGGGGAACAAGGCGATGGTGACAGCGGCAGCGGCCCCGATGGTGACATGGT + ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

I want to compare two insertion cites and find out whether they are different or not. Can I do this by comparing two assembled.fa somehow or I should better use sam file and IGV for that?

Thank you in advance!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by valerie60

Am I right that it should be

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam ref = reference_genome.fasta bs=bs.sh

instead of your script (without ref=reference_genome.fasta )?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by valerie60
1

There can be no spaces between options and = sign.

ADD REPLYlink written 2.2 years ago by genomax62k

assembled.fa (inspite of what the name says) appears to be in fastq format. How did that happen? I think you may have typed the name in as assembled.fq when you ran the tadpole.sh command.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax62k

You mean use it as reference for the next stage? Like

bbmap.sh in=eitherMapped.fq out=genomeMapped.sam ref=assembled.fa bs=bs.sh

where assembled.fa is converted from fastq to fa? I thought about using reference genome, not assembled sequence as I want to find out on what chromosome my insertion is located, what genes are nearby, etc. If I use assembled.fa, as far as I understand, I will not be able to get this information.

ADD REPLYlink written 2.2 years ago by valerie60

C: Identification of the sequence insertion site in the genome says that you do one of those two (or both if you want to). They are independent options though. That is the way I read it.

ADD REPLYlink written 2.2 years ago by genomax62k

Sorry for that. May I ask you whether it will work the way I want it to work if I use whole reference genome sequence and reference for bbmap.sh?

ADD REPLYlink written 2.2 years ago by valerie60
1

If you use the whole genome as a reference for BBMap, then you will be able to look at the sorted, indexed bam in IGV and see the exact genomic context of the insertion event. Basically, there will be a point that reads don't span. This will be much easier to find if you only use pairs where at least one read maps to the insertion sequence, because then the only part of the genome that will have any coverage will be the part around the insertion.

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k

Thank you so much @Brian and @genomax2! I very much appreciate your help!

ADD REPLYlink written 2.2 years ago by valerie60

@Brian would hopefully be along in 2-3 hours to answer that.

ADD REPLYlink written 2.2 years ago by genomax62k

You are right! It was my typo!

ADD REPLYlink written 2.2 years ago by valerie60

You could visually check with the alignment_filtered.sam file above using a genome browser. The site should be easily apparent. What kind of genome is this?

ADD REPLYlink written 2.2 years ago by genomax62k

Human genome, if I understood your question correctly

ADD REPLYlink written 2.2 years ago by valerie60
1

Then you would need to locate reads (by name, you can search in IGV) that correspond to the ones present in unmapped.sam file. Those would be the ones mapped at/near the insertion site.

ADD REPLYlink written 2.2 years ago by genomax62k

I will try this! Thank you!

ADD REPLYlink written 2.2 years ago by valerie60
1
gravatar for d-cameron
2.2 years ago by
d-cameron2.0k
Australia
d-cameron2.0k wrote:

As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.

If you sequence is known, but you need to validate/find the insertion site then you have a number of options. If you augment your reference genome with your inserted sequence as an additional chromosome then identify your insertion site become the equivalent of identifying an interchromosomal translocation. You have a number of options and if you're looking for a tool that will do everything including the assembly then I would recommend manta (performs breakpoint assembly), or GRIDSS (performs breakend assembly).

If you're looking for an insertion site and you don't know the sequence being inserted (eg attempting to identify a novel virus) then it'a a significantly more difficult problem. In this case, you don't have many options and your best bet is probably NovelSeq as that's what it's designed for. GRIDSS could do a better job assembling the breakpoints but it doesn't yet include novel insertions in the output VCF, nor does it do de novo assembly to reconstruct the full length of the inserted sequence as NovelSeq does.

Disclaimer: I develop GRIDSS.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by d-cameron2.0k
0
gravatar for Vitis
5 weeks ago by
Vitis1.9k
New York
Vitis1.9k wrote:

In addition to the excellent tools recommended by the developer, I'll add that soft clipped reads can be queried and analyzed for insertions. Simply use a tool such as pysam to find the clipped reads.

https://pysam.readthedocs.io/en/latest/

ADD COMMENTlink written 5 weeks ago by Vitis1.9k

I know this post is already vey old, but can you pls point to me where i can find out how to identify clipped reads using pysam. I can't find it in the docs.

ADD REPLYlink written 7 days ago by Assa Yeroslaviz1.1k
1

Pysam helps you fetch the CIGAR strings for reads. CIGAR strings would give you information about clipping by "S" and "H" operations. For this specific problem, insertion would generate junctions between inserts and genome. Reads spanning the junctions would have clipping because for a junction read, half would match genome, and the other half would match the insert. By screening for these reads, you will be able to find the exact junction between insert and genome.

ADD REPLYlink written 7 days ago by Vitis1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1453 users visited in the last hour