Question

Finding Plasmid Contruct Insertion in Genome through Split Alignment

1

Entering edit mode

6.8 years ago

chase.lewis ▴ 40

Hi guys, I am relatively new to alignment tools and bioinformatics in general. I have seen several questions on the forums pertaining to my question, but after following through on the answers, I can't seem to get the right result. Basically, I am trying to find where a plasmid tDNA insertion of known sequence is in an arabidopsis thaliana genome. My plan was to use split alignment methods for this. I have a reference genome consisting of 5 chromosomes in fasta format, to which I apended the known plasmid contruct sequence with the header ">Plasmid". I have paired end reads that I want to align to multiple places in the "Genome+Plasmid" reference with the hope that I can later use BLAST to find which reads align to one of the chromosomes as well as the plasmid. With this information I hope to use BLAST to determine the gene that the tDNA insertion is next to. It is my understanding thus far that I want split or "chimeric" reads, but not supplementary reads. I have a good understanding of how to use BLAST, but I am having difficulty finding split alignment reads. So far, I have tried bwa-mem, segemehl, and bowtie2 on ubuntu, but none have worked. I'm not sure if this is because I am incorrectly setting parameters or some other issue. Also, once the split alignments are found and marked, I'm not sure how to extract them. I also am unable to open .sam files, (not sure if people normally can?) If anyone could show me a step-by-step process with one of the above alignment tools (or a new tool if necessary) of how to find and extract split reads, this help would be greatly appreciated. If I am unclear in any way, please let me know. Thanks

segemehl commands- resulted in no reads ./segemehl.x -i refindex.idx -d ref.fas -q reads1.fasta -p reads2.fasta > output.sam samtools view -bS -o output.bam output.sam samtools sort output.bam sorted.bam samtools view -h -o sorted.sam sorted.bam ./testrealign.x -d ref.fas -q sorted.sam -n

bowtie2- not sure how to extract bowtie2 -x ref.fas.bowtie2 -1 reads1.fq.gz -2 reads2.fq.gz -S output.sam --local

bwa mem bwa mem ref.fas reads1.fq reads2.fq > output.sam samtools view output.sam I tried several different commands after this but I think I was going in the wrong direction. I believe I tried the virst command with the -M function as well. how can I extract split reads from this point?

alignment gene • 3.2k views

ADD COMMENT • link 6.8 years ago by chase.lewis ▴ 40

0

Entering edit mode

So far, I have tried bwa-mem, segemehl, and bowtie2 on ubuntu, but none have worked.

What does this mean? Do they produce an error? Don't forget to add the exact commands you used for the alignment.

I also am unable to open .sam files, (not sure if people normally can?)

How did you try to open them?

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I've looked online but have not been able to find any program that can open .sam files in a readable form. Basically I have not idea how to open them, so I haven't tried any method. For the alignment tools, when I used bwa-mem, I went through all the steps but did not find any split reads in the end. I'm not sure why. When I used bowtie2, it told me after the alignment that about 6000 reads aligned more than once, but I was not sure how to extract these. Also, I don't believe these reads are completely what I want, as they include reads where the same parts of the read align to different places. I want reads where different parts of the read align to different places in the reference. With segemehl, the program did not find any split reads in the end. I will soon post the exact commands I used.

ADD REPLY • link 6.8 years ago by chase.lewis ▴ 40

0

Entering edit mode

Look into samtools and supplementary SAM flags (2048): https://samtools.github.io/hts-specs/SAMv1.pdf

ADD REPLY • link 6.8 years ago by cpak1981 ▴ 140

0

Entering edit mode

I've tried something like this, but I will look into it more.

ADD REPLY • link 6.8 years ago by chase.lewis ▴ 40

0

Entering edit mode

Have you seen this thread: Identification of the sequence insertion site in the genome

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

I had not seen that forum. I'm not sure its the exact issue as mine because it seems that their insertion sequence is unknown. I forgot to mention that mine is known, which should make my problem simpler.

ADD REPLY • link 6.8 years ago by chase.lewis ▴ 40

0

Entering edit mode

I forgot to mention that mine is known, which should make my problem simpler

Indeed!

So what is the exact problem. BTW, SAM files are plain text and should be viewable using cat/less etc on unix. Don't try to open them on Windows with a GUI based editor.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

The problem is that when I go through all the steps for the programs, then try to extract using SAMtools view (and setting different flags), it does not appear to find any split reads. Is there a typical protocol for finding split reads with programs such as bwa mem? Should I be marking short split reads with the -M function? And how should I be attempting to extract the split reads after alignment?

ADD REPLY • link 6.8 years ago by chase.lewis ▴ 40

0

Entering edit mode

Very Large Insertion Detection Methodology Query In this thread, the answers basically say that bwa mem can make a sam file for me that contains any read where one end maps to one place in the genome and one end maps to another place in the genome. I'm not sure how to get bwa mem to do this for me. Somehow, I guess maybe with Samtools, although I'm not totally sure how, I want to extract these reads and individually BLAST them against the reference to make sure they are actually aligning to a place in the genome as well as the appended plasmid sequence (and not just two normal places in the genome).

ADD REPLY • link 6.8 years ago by chase.lewis ▴ 40

score 1 · Answer 1 · 2017-06-23

Well, I manged to solve the issue. I aligned the reads to the (Genome+Plasmid) reference using bwa mem without -M function. Then I converted to bam with samtools. I then used the program lumpy (lumpyexpress function) which had a specific built in script that allowed me to extract the split reads and put them in a new bam file, then I sorted this with samtools sort. How to activate the script can be found here https://github.com/arq5x/lumpy-sv under "pre-processing". (Heads up: lumpy has many prerequisites, at least compared to what I am used to, and these have their own prerequisites, so enjoy your journey down the rabbit hole) I then used this site here https://carleshf87.wordpress.com/2013/10/28/extracting-reads-for-a-single-chromosome-from-bamsam-file-with-samtools/ which gives a guide on how to extract reads that have a primary alignment to a certain chromosome. I chose as "Plasmid" because I had appended the sequence to my reference, and I wanted split reads that aligned to both this and another place in one of the chromosomes, which would indicate an insertion site. The file is extracted as a sam file. Then I used the "less" command followed by my new sam file name, which brought up the file in terminal. All the reads have "SA: (chromosome #), (sequence #)" following the read, indicating where the rest of the read aligned to (hence, split read). The SA tag indicates that the reads are split alignments. Many of them aligned to other places in the Plasmid (somewhat odd), but many aligned to the same place in one of the chromosomes, indicating that there was an insertion site here. I sorted the original bam file and indexed it with Samtools, and then opened it with the sequence graphing program "igv", then I zoomed in on the sequence on the right chromosome, which I got from the "SA: chr# sequence" from the split read file. Sure enough, I could visually see there was an insertion site, as none of the reads were covering a certain area, and were instead cut off, which makes sense as the ends of the reads read into the plasmid instead. Hopefully posting this helps someone! Thanks to those who answered.