Question

Uniquely mapped paired-end reads on the same chromosome

0

Entering edit mode

17 months ago

barzilayrom1 ▴ 10

Hi all,

I am new to the field of paired-end reads analysis, and would appreciate your feedback regarding several issues in analysis of paired-end reads alignment data:

In general, how come paired-end reads can be mapped to different chromosomes? While processing the FASTQ file using paired-end command, doesn't the aligner map a reads pair only if the two reads point toward each other on opposite strands (one aligned to the forward strand and the second to the reverse strand) in a known distance from one another? Or the aligner is just looking for the best fit for each read, which could result in mapping to different chromosomes, regardless of the fact that a pair of reads should in theory represent a genomic sequence from just one chromosome?
Following the alignment, how can I keep only the uniquely mapped reads, that are on the same chromosome? I know that samtools flagstat reports the number of pairs mapped to different chromosomes, but how can I extract only the pairs that are mapped to the same chromosome, and making sure that there are uniquely mapped? Are these just the reads tagged as 'Properly paired' (and if so how can I keep only them?)

Many thanks!

samtools alignment • 1.0k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 17 months ago by barzilayrom1 ▴ 10

0

Entering edit mode

Paired-end reads can be mapped to multiple chromosomes because sequence duplication exists in biology for all sorts of reasons. The aligner is looking for alignments. It doesn't know the design of your experiment, or how your library was made, or what assumptions it should bring to the analysis. Maybe there are gene fusions in your organism involving genes from different chromosomes (e.g. leukemia, various oncogenes, etc.). Some experiments are designed to look for these explicitly. However, that are many aligners that take various parameters that can help you narrow what you want reported back as valid alignments. And you can learn to filter your reads using the SAM bitwise flags for their alignment characteristics. There are many posts here discussing issues like this.

ADD REPLY • link 17 months ago by seidel 11k

0

Entering edit mode

Thanks a lot for the answer!

As you mentioned, I understand how sequence duplications could result in multimapping of a single read to different locations in one or several chromosomes. In the case of paired-end reads (naming them R1 and R2), as far as I understand, several alignments possibilities can occur :

R1 is multi-mapped (to different locations in the same chromosome and/or to different chromosomes), whereas R2 is uniquely mapped.
R1 is uniquely mapped, whereas R2 is multi-mapped.
Both R1 and R2 are multi-mapped (I assume this is very common).
Both R1 and R2 are uniquely mapped (I assume this is very unlikely).

I also understand that in paired-end sequencing the expected fragment length is known, and the two mates of a pair should be relatively close to each other.

Considering that, in each of the multi-mapping possibilities above (1-3), if the aligner would just look for the best alignment for each read independently then how would the mentioned advantage of paired-end sequencing (knowing that the two reads are related to each other and the expected fragment size) would come in place? In other words, shouldn't the aligner look for pairwise alignments in paired-end mode - sort of "enforcing" that both mates of a read-pair be aligned to one chromosome, let it be chr-A, chr-B etc., but not allowing, by definition, that the two mates be aligned to different chromosome (as the two-reads in paired-end seq are known to be derived from the same chromosome)?

Thanks a lot!

ADD REPLY • link 17 months ago by barzilayrom1 ▴ 10