Question: Identification of short insertions in genome
1
gravatar for valerie
22 months ago by
valerie60
valerie60 wrote:

Hi guys,

I need to identify insertions of specific sequences in my genomes. I have WGS data with read length of 250 nucl. Previously I was working on identification of large sequences insertions and simply looked for discordantly mapped reads for that using bowtie2 and bbmap. Now I am looking for sequences that are shorter than read length. As I want these data to be comparable to my previous computations I want to use bowtie2 for alignment again. As far as I understand, I should use local mode. I've tried this, but got alignment of very short parts of my reads. Is there a possibility to require the alignment length to be equal to reference sequence length? When I identify the reads that are mapped to the reference sequences, I will find where are their pairs mapped to the reference genome to find the insertion site, is it a correct strategy?

Thanks id advance!

alignment wgs • 927 views
ADD COMMENTlink modified 20 months ago by Biostar ♦♦ 20 • written 22 months ago by valerie60
3
gravatar for timstuart
22 months ago by
timstuart30
USA
timstuart30 wrote:

I'd recommend using lumpy or delly2 for small indels, rather than trying reinvent your own method

https://github.com/dellytools/delly

https://github.com/arq5x/lumpy-sv

ADD COMMENTlink written 22 months ago by timstuart30

Thank you! Actually they are almost the size of a read (~230 nucl), is it a small indel?

ADD REPLYlink written 22 months ago by valerie60

I think delly and lumpy should both be able to detect indels 200-300 bp without much trouble

ADD REPLYlink written 22 months ago by timstuart30

Great! Thank you, Tim!

ADD REPLYlink written 22 months ago by valerie60

230bp ! Looks more like tandem duplication than indel. Just a guess.

ADD REPLYlink written 22 months ago by poisonAlien2.8k
2
gravatar for Brian Bushnell
22 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can call insertions near read length with BBMap like this (assuming paired interleaved reads that mostly overlap):

bbmerge.sh in=reads.fq out=merged.fq
reformat.sh in=merged.fq out=filtered.fq minlen=350
bbmap.sh in=filtered.fq out=mapped.sam ref=ref.fa slow
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa ploidy=2

That's a variant of a longer and slower pipeline that I used to successfully call insertions from 1bp to 219bp using 2x100bp reads, on E.coli. If you have sufficient time, coverage, and memory, you can use this method (adjusted for 2x250bp reads) instead for even longer insertions (it's great on small genomes but would take a lot of memory for human-sized ones):

filterbytile.sh in=reads.fq.gz out=filtered_by_tile.fq.gz
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz maxns=0 ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=220 ref=adapters.fa
clumpify.sh in=trimmed.fq.gz out=eccc.fq.gz passes=4 ecc unpair repair
tadpole.sh in=eccc.fq.gz out=ecct.fq.gz ecc prefilter=2
tadpole.sh in=ecct.fq.gz out=extended.fq.gz ordered mode=extend el=20 er=20 k=62 prefilter=2
bbmerge-auto.sh in=extended.fq.gz out=merged.fq.gz outu=unmerged.fq.gz rem k=93 extend2=120 prefilter=2
bbmap.sh in=merged.fq.gz out=merged.sam.gz slow ref=ref.fa
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa ploidy=2
ADD COMMENTlink written 22 months ago by Brian Bushnell16k

Dear Brian, Could you please explain in few words what is going on here? Should I add my sequences of interest to reference genome in genome.fa to use the first option? Thank you!

ADD REPLYlink written 22 months ago by valerie60
1

Nope, definitely do not add your sequences to the genome! The genome should just be the reference.

The way this works is...

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa slow
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa

That maps reads to the reference, and calls variants (including insertions) based on the cigar strings of the mapped reads. BBMap is very good at mapping reads with indels, and with the "slow" flag, will allow insertions up to roughly 50% of the read length. So in this case you'd call insertions up to ~125bp using your 250bp reads. However, 230bp is too long for this approach - to call them, you would need longer reads, at least ~460bp. Depending on your insert size distribution, 2x250bp reads can be merged to produce up to ~490bp merged reads. So, the commands merge the overlapping reads, and filter out the short ones (under 350bp; possibly 400bp would be a better cutoff). Mapping the remaining long merged reads, it is possible to call 230bp insertions, though that's near the limit.

The second set of commands does a better job by extending the reads using kmers. This allows merging of non-overlapping reads using information from other reads that span the gap, and additionally lengths both ends of all the reads, potentially resulting in reads over 800bp long, which could allow detection of ~400bp insertions. Though that would require a library with quite large insert sizes. Several of the lines in the second block (such as filterbytile and clumpify) are devoted to error-reduction and error-correction, because the process of read extension is very sensitive to errors - reads with errors near the ends can't be extended.

ADD REPLYlink written 22 months ago by Brian Bushnell16k

Brian, thank you so much for your explanation! May I ask you one more question? Actually I am looking for new copies of repetitive elements in cancer genomes. I suppose that tumors are mosaic and I want kind of "measure" the activity of transposition process not identify precisely each insertion. This means that I want to take into account each read pair that corresponds to transposition event, not only well-covered transpositions, that are present in each cell. Can I somehow play with coverage when calling for variants and look for the variants with minimal coverage too?

ADD REPLYlink written 22 months ago by valerie60
1

Sure. CallVariants has coverage and allele-frequency flags for allowing low-depth variants to be called. I'd suggest you add the flag "rarity=0.05" if you want to call down to 5% allele-frequency events without penalizing their score, for example.

However, you may not be able to do a good job of extending reads with kmers for low-depth variants, since that process is reliant on coverage (may not work for a variant that is only covered by 3 reads, for example). Simply merging the reads without extension is not reliant on coverage, though.

ADD REPLYlink written 22 months ago by Brian Bushnell16k

Thank you very much for your help, Brian!

ADD REPLYlink written 22 months ago by valerie60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1526 users visited in the last hour