Question: Do read pairs matter in alignment?
gravatar for mmorenozambrano
5.3 years ago by
mmorenozambrano10 wrote:


I have this bam file that originally contained paired sequences, after converting it into fastq sequences I got 2 different files. My idea is to make an alignment with a chromosome sequence, and from there obtain a VCF. My question is if the sequence that I will use will matter in my future results?, there is a way to know which sequence should I use?

I got the sequences using samtools:

bam2fastq -o blah_aligned_R#.fastq --no-unaligned Bamfilename.bam


Thank you very much for the comments, I will explain myself better (and sorry for any misunderstanding, but this is the first time that I am working in bioinformatics).

I am wanting to see the occurrence of KRAS-variant mutation (as described by Ratner et al, in in chromosome 12 in ovarian cancer cell lines. For this, I have downloaded some bam files from the nci60-project ( These bam files are presented as paired sequences, so after extracting chromosome 12 information from them (using the bai files that are in the same webpage as index) by the code:

$ samtools view  IGR-OV1_reord_mdups_ralgn_fmate_recal.bam chr12>chr12.bam

Then, I sorted the bam file, and instead of using the original code that I posted at the beginning, I converted them to fastq using:

$ bedtools bamtofastq -i chr12.qsort.bam \
                      -fq chr12.end1.fq \
                      -fq2 chr12.end2.fq

So now, I have two sequences in fastq format per each original bam file.

My following steps, that I have in mind, is to align these fastq files with the human assembly 37, convert then again to bam files, and make the variant calling with mpileup so from there start with the annotation using annovar or something alike. So my questions in fact are:

1) Is my sort of pipeline acceptable?

2) Should I use both sequences for the variant calling?

Thank you in advance for your help and comments

sequence next-gen alignment • 2.0k views
ADD COMMENTlink modified 5.3 years ago by Bioinfosm620 • written 5.3 years ago by mmorenozambrano10

I assume you got a pair of fastq files and one file with unaligned reads. You should use both fastq files and align them in a paired end mode. 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Ashutosh Pandey12k

I am not sure I can understand the question(s). Here's my guess.

1) Question in the title: Do read pairs matter in alignment?. Answer: Yes. Follow the suggestion of Ashutosh Pandey

2) First question in the post "the sequence that I will use will matter in my future results?". I assume you are talking about the reference sequence. Yes. Everything you will obtain will be referred to the sequence you use. Is like when you have to point out where New York is, and you have to choose on which map you are going to work. Different maps might give different results, especially if they are very old!

3) Second question in the post: there is a way to know which sequence should I use? Quite often I use the rule: the newest!

You might consider clarifying the questions, so people can help you more efficiently!


ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Fabio Marroni2.5k

Note to Ashutosh and Fabio: The post seems to have been updated.

To the original question, you can simply call variants with the original BAM file and restrict the caller to the region around the KRAS gene. That way you don't need to bother even remapping things.

Note that otherwise the ideal method is to extract reads as pairs, since aligners can use the pairing information to not mistakenly think that coverage is doubled where the mates overlap.

ADD REPLYlink written 5.3 years ago by Devon Ryan95k

I agree 100% with Devon Ryan.

ADD REPLYlink written 5.3 years ago by Fabio Marroni2.5k
gravatar for Bioinfosm
5.3 years ago by
Bioinfosm620 wrote:

The steps sound fine, and as others have suggested, using both read1 and read2 for alignment is more efficient. The paired mapping allows for better read placement, especially over repetitive or homologous regions.

Am not sure why you dont directly use the aligned BAM file to call variants on chr12 and then annotate them or intersect for any falling in KRAS?

If it is just a single position in KRAS, seems the paper you ref its a 3' UTR mutation, you can simply genotype that coordinate (chromosome-position) on all the BAM files and get a quick answer if the mutation is present/absent.

Hope this helps.

ADD COMMENTlink written 5.3 years ago by Bioinfosm620

Thank you very much for your answers. The reason why I am not calling variants directly from the BAM files, is because a suggestion of my promoter, who told me that I should demonstrate that I am able to remap the BAM files going first by their transformation to fastq files. But know that you have helped me, I will put in practice the different approaches to compare my results, and see (hopefully) that my remaps are going to be the same.

ADD REPLYlink written 5.3 years ago by mmorenozambrano10

Sure, or even better. If you are not sure what aligner was originally used or have a better one to try! You can use something like samtools flagstat and compare how your aligned BAM file is same/better than the original one.

ADD REPLYlink written 5.3 years ago by Bioinfosm620
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1464 users visited in the last hour