Question: detecting putative t-dna insertion from genome sequencing data from Arabidopsis
0
gravatar for andreabarghetti
3.1 years ago by
andreabarghetti0 wrote:

Hi

Problem:

I have genome seq PE reads of an Arabidopsis mutant. What I know of this mutant is that: a) it has a t-DNA insertion somewhere, containing a GUS transgene. b) it has at least another mutation, causing a phenotype I am interested in.

What could be a possible strategy to identify where the GUS transgene is placed, and also to find other possible insertions or deletions in the genome?

What I tried I mapped reads to the genome, and assembled contigs from unmapped reads. This way I could assemble a contig containing part of the t-dna insertion, but couldn't find the flanking sequence (so I don't know where it is). I could find this contig just by searching for the GUS sequence. But I also would like to find a way to detect other possible insertions of which I don't know any sequence, and I don't know how.

I also tried to look at all the reads that were mapped and had their mate unmapped. I thought I could detect this way the insertion, but I end up with lots of reads everywhere (I don't know if it is normal to have so many unmapped reads, or if I used wrong mapping parameters).

Any suggestion?

ADD COMMENTlink modified 2.7 years ago by Chris Saunders170 • written 3.1 years ago by andreabarghetti0
1
gravatar for Chris Saunders
2.7 years ago by
Illumina
Chris Saunders170 wrote:

You might try detecting insertions with manta: https://github.com/Illumina/manta (full disclosure: I'm an author). This is a general purpose SV and indel caller which includes insertion detection -- it will report both fully assembled insertions and "imprecise" insertion points where a putative insertion start and end point are found but the insert sequence cannot be fully assembled. Although it is used predominantly on human samples it requires no species-specific data or MIE sequences, just a reference and PE read alignments, so it may work reasonably well for Arabidopsis.

ADD COMMENTlink written 2.7 years ago by Chris Saunders170
0
gravatar for nasim.zeeshan25
2.7 years ago by
Korea, Republic Of
nasim.zeeshan250 wrote:

Hello andreabarghetti, Have you find any solution how to find the position of the transgene from the genome sequencing data?

ADD COMMENTlink written 2.7 years ago by nasim.zeeshan250
0
gravatar for Carlo Yague
2.7 years ago by
Carlo Yague4.4k
Belgium
Carlo Yague4.4k wrote:

I also tried to look at all the reads that were mapped and had their mate unmapped. I thought I could detect this way the insertion, but I end up with lots of reads everywhere (I don't know if it is normal to have so many unmapped reads, or if I used wrong mapping parameters).

Yes it can be normal, there are always some unmapped reads (you can check the % of mapping with ‘samtools flagstats')...

To reduce the background, you can try to filter those reads further by taking only the mapped mates of the unmapped reads that specifically blast on your transgene sequence.

ADD COMMENTlink written 2.7 years ago by Carlo Yague4.4k
0
gravatar for nasim.zeeshan25
2.7 years ago by
Korea, Republic Of
nasim.zeeshan250 wrote:

Thanks @Carlo Yague, Can you explain a bit in detail. First I need to map to know how much % of reads are un-mapped (as my gene is not from Arabidopsis, so it should be in un-mapped reads. Right?). Then I will blast those un-mapped reads to the sequence of my gene? (What tool would you recommend for this?)

ADD COMMENTlink written 2.7 years ago by nasim.zeeshan250
  • To know how many reads are unmapped you need to check the flags : samtools flagstats
  • To extract the unmapped reads whose mate are mapped : samtools view with correct flag filters
  • To blast them on your sequence, you can use... blast+ !

-> At this point you have the unmapped reads that come from your gene of insertion. To know where it is inserted in your genome, you can look at the mapping position of their mapped mates

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Carlo Yague4.4k

I used samtools flagstats, it gave : 45809554 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 33530536 + 0 mapped (73.20% : N/A) 45809554 + 0 paired in sequencing 22904777 + 0 read1 22904777 + 0 read2 31581052 + 0 properly paired (68.94% : N/A) 31766210 + 0 with itself and mate mapped 1764326 + 0 singletons (3.85% : N/A) 84670 + 0 with mate mapped to a different chr 60488 + 0 with mate mapped to a different chr (mapQ>=5) Then i extracted unmapped reads using the command $samtools view -b -f 4 input.bam > unmapped.output and it gave an output bam file

Now I need to blast them to my gene sequence right? but I dont understand this part of your answer " To know where it is inserted in your genome, you can look at the mapping position of their mapped mates" can you please explain it a bit.. Sorry for any inconvenience it may cause :(

ADD REPLYlink written 2.7 years ago by nasim.zeeshan250

No problem ! Imagine a pair of reads that come from the region of insertion. One read is unmapped on the genome because it is located within the insertion. The second read (they are mates), however, mapped on the genome because it is located outside the gene of insertion. Now if you can confirm that the first read (the unmapped one) really comes from the insertion with blast, the second one (the mapped one) will give you the approximate genomic location of the insert.

ADD REPLYlink written 2.7 years ago by Carlo Yague4.4k

Hmm I got it thanks. But how can I extract those mate reads? using: $samtools view -b -f 8 input.bam > mate.output? Thanks for your kind help :)

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by nasim.zeeshan250

You can play with -f and -F. Note that both reads are each other mates so don't be confused. To extract the alignments of the mates of the reads that blast on your sequence, you can simply use grep (both mates have the same name). Here is the code I would use.

# get reads unmapped with mate mapped then extract sequence as fasta (for the blast)
samtools view -f 5 -F 8 unmapped.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > oneEndUnmapped.fasta

# get reads mapped with mate unmapped as sam (only primary alignments)
samtools view -f 9 -F 260 accepted_hits.bam  > oneEndMapped.sam

# get header only (will be useful later)
samtools view -H oneEndMapped.bam > oneEndMapped.header

# blast
blastn -query oneEndUnmapped.fasta -task blastn -db my_insertion.fasta  -outfmt 6 -max_target_seqs 1 > oneEndUnmapped.blasted

# get the mates of the hits (you might filter the hits before that)
awk '{print $1}' oneEndUnmapped.blasted | grep -F -f - oneEndMapped.sam | cat oneEndMapped.header - | samtools view -Sb - > mates_of_reads_blasted.bam
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Carlo Yague4.4k

@Carlo Yague, thanks a lot. I tried to execute these codes, first three worked but there are some errors when I try to run the blast like:

Error: (106.18) NCBI C++ Exception: Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: /Users/znasim09/Documents/ncbi-blast-2.2.18+/bin/blastn Error: (CArgException::eSynopsis) Application's initialization failed

I don't know whats wrong with it :(

ADD REPLYlink written 2.7 years ago by nasim.zeeshan250

There were some errors, now its working. I got a blasted file of about 50kb, is it ok to get such small sized file?

ADD REPLYlink written 2.7 years ago by nasim.zeeshan250

I guess its ok, the output gives you one line by significant match. You can try wc -l oneEndUnmapped.blasted to know the exact number of lines.

ADD REPLYlink written 2.7 years ago by Carlo Yague4.4k

The wc -l oneEndUnmapped.blasted gave "468 oneEndUnmapped.blasted" result.

However, by using the last code (mentioned by you), I got a bam file of around 400bytes. I guess there is something wrong with it. As I tried to convert it to fasta but it resulted empty file.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by nasim.zeeshan250

Look at the intermediate sam file :

awk '{print $1}' oneEndUnmapped.blasted | grep -F -f - oneEndMapped.sam > temp.sam
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Carlo Yague4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour