Reads and Read Segments in Alignment
2
2
Entering edit mode
9.2 years ago
jrowan ▴ 20

My understanding of SAM files and the format is fairly good, but there are some things I haven't quite grasped. I'm not sure how obvious you all may find these questions, but they're what have come to mind.

I'm interested in recovering the original sequenced read after some alignment has been done. I'd like to know which pieces of the read/read segments I need. How do I know what the full read sequence was that came off the sequencer? Can I reconstruct it by connecting the read segments in the same template together? If so, what is the template? It's not the read is it? The SAM format doesn't suggest that it is; it says the template is some DNA/RNA fragment.

Here are some questions:

  • "What is the difference between the read I get from sequencing and the read segments I see in a SAM file?"
  • Intuition tells me that read segments are mapped portions of the larger read, but are they arbitrarily segmented in the SAM presentation?
  • Are segments contiguous? Can they also be non-contiguous?
  • Can I reconstruct the full read from the multiple read segments?
  • How does template correspond to a sequence read?

I'm very grateful for any clarification I can get on these questions.

alignment sequencing SAM • 3.9k views
ADD COMMENT
0
Entering edit mode

You can look into picard SamToFastq

ADD REPLY
0
Entering edit mode

That worked for me. Thanks!

ADD REPLY
2
Entering edit mode
9.2 years ago
Renesh ★ 2.2k

The given read from your query file (fastq file) can match to multiple locations in genome. You can check this with NH flag in sam file. The reads can also overlap each other as they are sequenced from DNA fragments and you can find this by comparing the mapping co-ordinates in sam file.

The aligner take only read sequence from fastq file for mapping to reference sequence. If you want use contigous sequence (contig), you need to assemble it first and then map with reference sequence. As the contig will be longer in length, you need to be cautious while using aligner.

ADD COMMENT
0
Entering edit mode

I think this answers the question if I change your sentence to be: "The aligner takes - one at a time - a single read sequence from the fastq file for mapping to the reference." I assume that's what you meant.

I'm now curious as to how alignment is presented in the SAM file if I used a contig as opposed to mapping each read. Does the SAM output change? I can't see anything in the format specification that says it would. No flag or anything.

ADD REPLY
1
Entering edit mode
9.2 years ago
Renesh ★ 2.2k

The read segment in sam file (column 10) is same as the read sequence in your query file (As per output from Bowtie2 and bwa).

You can construct the full read from multiple segment using any transcriptome/genome assembly tool

ADD COMMENT
0
Entering edit mode

Thank you for the reply. Again, please forgive me if these questions are obvious to others.

I can see that these are the read sequences as they appear in a FASTA (or similar) file. It is possible that a given segment mapped to multiple locations, right? Isn't it also possible to have several overlapping (to various degree) segments?

With this in mind, how was alignment done? Does alignment take the full read or sections of the read in aligning? That is, is alignment performed with each read segment or the full read sequence against the reference? (Is this something I'll need to crawl through code for? Bowtie2's source, for instance?)

Edit: Ah, I think you've already given the answer in a somewhat roundabout fashion. You used the phrase query file, which makes me believe that each sequence therein is a query sequence for alignment. So, I take it that the read segments presented on column 10 of SAM files were used in alignment (and that alignment did not use the full read sequence as one long, contiguous string). Is this correct?

ADD REPLY

Login before adding your answer.

Traffic: 2960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6