detecting putative t-dna insertion from genome sequencing data from Arabidopsis
4
0
Entering edit mode
8.1 years ago

Hi

Problem:

I have genome seq PE reads of an Arabidopsis mutant. What I know of this mutant is that: a) it has a t-DNA insertion somewhere, containing a GUS transgene. b) it has at least another mutation, causing a phenotype I am interested in.

What could be a possible strategy to identify where the GUS transgene is placed, and also to find other possible insertions or deletions in the genome?

What I tried I mapped reads to the genome, and assembled contigs from unmapped reads. This way I could assemble a contig containing part of the t-dna insertion, but couldn't find the flanking sequence (so I don't know where it is). I could find this contig just by searching for the GUS sequence. But I also would like to find a way to detect other possible insertions of which I don't know any sequence, and I don't know how.

I also tried to look at all the reads that were mapped and had their mate unmapped. I thought I could detect this way the insertion, but I end up with lots of reads everywhere (I don't know if it is normal to have so many unmapped reads, or if I used wrong mapping parameters).

Any suggestion?

t-dna insertions genome sequencing • 3.1k views
ADD COMMENT
1
Entering edit mode
7.7 years ago

You might try detecting insertions with manta: https://github.com/Illumina/manta (full disclosure: I'm an author). This is a general purpose SV and indel caller which includes insertion detection -- it will report both fully assembled insertions and "imprecise" insertion points where a putative insertion start and end point are found but the insert sequence cannot be fully assembled. Although it is used predominantly on human samples it requires no species-specific data or MIE sequences, just a reference and PE read alignments, so it may work reasonably well for Arabidopsis.

ADD COMMENT
0
Entering edit mode
7.7 years ago

Hello andreabarghetti, Have you find any solution how to find the position of the transgene from the genome sequencing data?

ADD COMMENT
0
Entering edit mode
7.7 years ago

I also tried to look at all the reads that were mapped and had their mate unmapped. I thought I could detect this way the insertion, but I end up with lots of reads everywhere (I don't know if it is normal to have so many unmapped reads, or if I used wrong mapping parameters).

Yes it can be normal, there are always some unmapped reads (you can check the % of mapping with ‘samtools flagstats')...

To reduce the background, you can try to filter those reads further by taking only the mapped mates of the unmapped reads that specifically blast on your transgene sequence.

ADD COMMENT
0
Entering edit mode
7.7 years ago

Thanks @Carlo Yague, Can you explain a bit in detail. First I need to map to know how much % of reads are un-mapped (as my gene is not from Arabidopsis, so it should be in un-mapped reads. Right?). Then I will blast those un-mapped reads to the sequence of my gene? (What tool would you recommend for this?)

ADD COMMENT
0
Entering edit mode
  • To know how many reads are unmapped you need to check the flags : samtools flagstats
  • To extract the unmapped reads whose mate are mapped : samtools view with correct flag filters
  • To blast them on your sequence, you can use... blast+ !

-> At this point you have the unmapped reads that come from your gene of insertion. To know where it is inserted in your genome, you can look at the mapping position of their mapped mates

ADD REPLY
0
Entering edit mode

I used samtools flagstats, it gave : 45809554 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 33530536 + 0 mapped (73.20% : N/A) 45809554 + 0 paired in sequencing 22904777 + 0 read1 22904777 + 0 read2 31581052 + 0 properly paired (68.94% : N/A) 31766210 + 0 with itself and mate mapped 1764326 + 0 singletons (3.85% : N/A) 84670 + 0 with mate mapped to a different chr 60488 + 0 with mate mapped to a different chr (mapQ>=5) Then i extracted unmapped reads using the command $samtools view -b -f 4 input.bam > unmapped.output and it gave an output bam file

Now I need to blast them to my gene sequence right? but I dont understand this part of your answer " To know where it is inserted in your genome, you can look at the mapping position of their mapped mates" can you please explain it a bit.. Sorry for any inconvenience it may cause :(

ADD REPLY
0
Entering edit mode

No problem ! Imagine a pair of reads that come from the region of insertion. One read is unmapped on the genome because it is located within the insertion. The second read (they are mates), however, mapped on the genome because it is located outside the gene of insertion. Now if you can confirm that the first read (the unmapped one) really comes from the insertion with blast, the second one (the mapped one) will give you the approximate genomic location of the insert.

ADD REPLY
0
Entering edit mode

Hmm I got it thanks. But how can I extract those mate reads? using: $samtools view -b -f 8 input.bam > mate.output? Thanks for your kind help :)

ADD REPLY
0
Entering edit mode

You can play with -f and -F. Note that both reads are each other mates so don't be confused. To extract the alignments of the mates of the reads that blast on your sequence, you can simply use grep (both mates have the same name). Here is the code I would use.

# get reads unmapped with mate mapped then extract sequence as fasta (for the blast)
samtools view -f 5 -F 8 unmapped.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > oneEndUnmapped.fasta

# get reads mapped with mate unmapped as sam (only primary alignments)
samtools view -f 9 -F 260 accepted_hits.bam  > oneEndMapped.sam

# get header only (will be useful later)
samtools view -H oneEndMapped.bam > oneEndMapped.header

# blast
blastn -query oneEndUnmapped.fasta -task blastn -db my_insertion.fasta  -outfmt 6 -max_target_seqs 1 > oneEndUnmapped.blasted

# get the mates of the hits (you might filter the hits before that)
awk '{print $1}' oneEndUnmapped.blasted | grep -F -f - oneEndMapped.sam | cat oneEndMapped.header - | samtools view -Sb - > mates_of_reads_blasted.bam
ADD REPLY
0
Entering edit mode

@Carlo Yague, thanks a lot. I tried to execute these codes, first three worked but there are some errors when I try to run the blast like:

Error: (106.18) NCBI C++ Exception: Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: /Users/znasim09/Documents/ncbi-blast-2.2.18+/bin/blastn Error: (CArgException::eSynopsis) Application's initialization failed

I don't know whats wrong with it :(

ADD REPLY
0
Entering edit mode

There were some errors, now its working. I got a blasted file of about 50kb, is it ok to get such small sized file?

ADD REPLY
0
Entering edit mode

I guess its ok, the output gives you one line by significant match. You can try wc -l oneEndUnmapped.blasted to know the exact number of lines.

ADD REPLY
0
Entering edit mode

The wc -l oneEndUnmapped.blasted gave "468 oneEndUnmapped.blasted" result.

However, by using the last code (mentioned by you), I got a bam file of around 400bytes. I guess there is something wrong with it. As I tried to convert it to fasta but it resulted empty file.

ADD REPLY
0
Entering edit mode

Look at the intermediate sam file :

awk '{print $1}' oneEndUnmapped.blasted | grep -F -f - oneEndMapped.sam > temp.sam
ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6