Question: How to obtain insert sequence from paired-end mapping result (SAM/BAM)
0
gravatar for gundalav
2.9 years ago by
gundalav280
La La Land
gundalav280 wrote:

I have paired-end reads R1 and R2.

|-----R1-----|----------------------------|-----R2-----|
       A                     B                  C

Total insert here is A + B + C. How can I get this sequence from the BAM/SAM output?

The reads are mapped using classical Bowtie2 parameter.

    bowtie2 -x myindex \
   -1 R1.fastq \
   -2 R2.fastq \
   --no-discordant \
   --no-mixed -p 4 \
    --very-sensitive  \
   --no-mixed \
    | samtools view -bS - > output bam
ADD COMMENTlink modified 2.8 years ago by Fabio Marroni2.3k • written 2.9 years ago by gundalav280
1

You can't really get that solely from a SAM/BAM file, since the sequence appropriate for B for a particular fragment is unknown. What is your actual biological goal?

ADD REPLYlink written 2.9 years ago by Devon Ryan91k

If we know the mapping location of A and C, we can get A+B+C there right? With Picard we can get the A+B+C length using pathtopicard\CollectInsertSizeMetrics.jar I wonder if there is any existing tool to actually extract the fragment from the genome. At the end of the day want to translate A+B+C into protein and derive MHClass-I core from there.

ADD REPLYlink written 2.9 years ago by gundalav280

It's unclear if you want the sequence of the fragment or the sequence of the reference genome covered by the fragment. For the latter, sure, that's easy enough with a small bit of scripting (I'd use pysam in python, but to each his/her own). For the former, see my earlier comment.

ADD REPLYlink written 2.9 years ago by Devon Ryan91k
1
gravatar for Fabio Marroni
2.8 years ago by
Fabio Marroni2.3k
Italy
Fabio Marroni2.3k wrote:

Ok, if your insert strt at ChrXXX position Astart and end on the same ChrXXX at position Cend.

Then you create a bedfile and you call it yourbed.bed

ChrXXX Astart Cend

and then use

bedtools getfasta -fi yourfasta.fa -bed yourbed.bed

Done!

However, I re-emphasize what Devon just said: you do not know what the real sequence of B is. You are assuming that your sample has identical sequence to the reference.

ADD COMMENTlink written 2.8 years ago by Fabio Marroni2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1120 users visited in the last hour