Question: Converting BAM to Fastq - losing reads
3
gravatar for adampennycuick
6 months ago by
UCL, London
adampennycuick80 wrote:

Hi all,

I am trying to realign a whole genome BAM file from one reference genome to another. The reason for this is that I am interested in HLA regions, and the original reference genome does not include these regions. The process involves converting the name-sorted BAM file to fastq, then realigning the fastq to a new reference.

I seem to be losing reads when converting from BAM to fastq. I have tried a number of ways to do this, including:

  • samtools fastq -1 < file1.fq > -2 < file2.fq > < input.bam >
  • bamToFastq -i < input.bam > -fq < file1.fq > -fq2 < file2.fq >
  • Following the process here:

In each case the number of reads in my output fastq file (counted using wc -l <file> / 4) is slightly less than the original BAM file (counted using samtools flagstat).

When using bamToFastq I get several errors like this:

*****WARNING: Query 6:1219:30638:3260 is marked as paired, but its mate does not occur next to it in your BAM file.  Skipping.

I suspect this is the cause of my read loss. Most of these seem to be in chromosome 6, which is my region of interest. I have tried using samtools fixmate, but still get this same error.

Any ideas would be greatly appreciated!

Many thanks

alignment • 668 views
ADD COMMENTlink modified 6 months ago by h.mon19k • written 6 months ago by adampennycuick80
1

+1 for one of the best drafted questions I've seen on this site!

ADD REPLYlink written 6 months ago by Ram17k
1

Thanks Ram! Let's hope it gets an answer...

ADD REPLYlink written 6 months ago by adampennycuick80

Did you check some of the problematic reads on the original bam file? Something like:

samtools view file.bam | grep "6:1219:30638:3260"

It could be useful to add -n to grep to check the line number, specially for the name-ordered files.

ADD REPLYlink written 6 months ago by h.mon19k

Good thought. The output is below for one of the coordinates which gives an error, for a sorted file. I am not great at interpreting these data, but perhaps the problem here is that there are an odd number of reads mapping to these coordinates so they cannot be appropriately paired? Do you know how I could fix this?

samtools view file.bam | grep -n "6:2224:32617:20858"
1748391:6:2224:32617:20858  2145    5   177974622   0   15H23M113H  22  16606471    0   CACCCACCCACACCCCCCCACAC FAFFKKA,,7A,77,,F<<<F,A AS:i:23 XS:i:23 SA:Z:16,86736939,+,52S26M73S,9,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0;  ci:i:1775658    MD:Z:23 NM:i:0  RG:Z:1-C42D3D1
1748392:6:2224:32617:20858  2145    12  130616201   1   31H20M100H  22  16606471    0   CCCACACCCACAACCCCACC    F<<<F,AK7A7,A,AFFA<<    AS:i:20 XS:i:19 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;   ci:i:1775658    MD:Z:20 NM:i:0  RG:Z:1-C42D3D1
1748393:6:2224:32617:20858  2145    16  3539896 0   4H21M126H   22  16606471    0   ACCCCCCCCCCCACCCACCCA   <AAFAFAF<<AFAFFKKA,,7   AS:i:21 XS:i:21 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;12,130616201,+,31S20M100S,1,0;    ci:i:1775658    MD:Z:21 NM:i:0  RG:Z:1-C42D3D1
1748394:6:2224:32617:20858  97  16  86736939    9   52S26M73S   22  16606471    0   AAACACCCCCCCCCCCACCCACCCACACCCCCCCACACCCACAACCCCACCACCCCCACACACCCACACACCCACACAACTGGAGCCCAGCAAGCACCACCCGCCCGACCGCGAAGACAAGCCGAGGAGCAGAGCAGACACGAAAGAAGGG AA<A<AAFAFAF<<AFAFFKKA,,7A,77,,F<<<F,AK7A7,A,AFFA<<,F7,<(,,AAFF7FFK,F,7<,<,7,,,,,,,,777FF,,,,,77,,,,,((((((,,,((,(,,,,7,,,,7<7(,,,,,7,,,,,,,,,,,,,,,,,, AS:i:26 XS:i:23 SA:Z:5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0; ci:i:1775658    MQ:i:0  MS:i:637    MC:i:16606545   MD:Z:26 NM:i:0  RG:Z:1-C42D3D1
1748395:6:2224:32617:20858  145 22  16606471    0   76S20M55S   16  86736939    0   GTTATCAATCACACCCCATCGCCAGATCACCATTCTCAAACTATCCGTCTCCCAGTCTCTAATACATTGGCGTGGGTGCTGCTGCGTTCTGGGTGTCGCCTCTTTCTTGTTCTGCGCTGGGGGCCGCGTGTGATGTTTGGCGTGTTCCGGG ,,,,,,,,,,,,,,,,,,,(,,,,,,,,,,,,,,,A<7,,77,,,,,A,,,,,77,7,,,,,7,,,7,(((,7(7,,,,,(,7A,,,,,A,,,,(,(,,,,,,,,77F7,,(,,(((((,(,(,(,7,7,,,7,,7,,(,,,,7<,,,,,, AS:i:20 XS:i:20 ci:i:1775658    MQ:i:9  MS:i:2157   MC:i:86736887   MD:Z:20 NM:i:0  RG:Z:1-C42D3D1
ADD REPLYlink modified 6 months ago by genomax55k • written 6 months ago by adampennycuick80

I don't know exactly how samtools flagstat works, but if it is reporting supplementary alignments (the reads with 2145 flag) on its total number of reads, then it is correct to have a smaller number of reads on your final fastq files.

Do you know if these are DNAseq or RNAseq reads? How were they aligned?

ADD REPLYlink modified 6 months ago • written 6 months ago by h.mon19k

I think you have cracked it - the supplementary alignments are being lost.

However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome. As I understand it, supplementary reads may indicate structural variation; these are cancer samples so I would expect some structural variation. I don't want to lose this information on realigning my sample. The fact that a lot of these supplementary reads are in HLA regions suggest they could have a significant impact on my analysis.

Is it possible to extract to fastq and include these reads? I don't unfortunately have access to the original unaligned fastq files.

ADD REPLYlink written 6 months ago by adampennycuick80
1

As long as you are recovering all unique read identifiers (including their origin R1/R2) that are present in your BAM file there is not much more you can do.

ADD REPLYlink written 6 months ago by genomax55k

Could it be that you have secondary alignments in your lib_002_map_map.bam file ? This could mess with the whole thing. You can check for secondary alignments using samtools flagstat.

ADD REPLYlink written 6 months ago by Carlo Yague4.2k

Thanks but I don't think this is it - there are no secondary alignments identified by samtools flagstat

ADD REPLYlink written 6 months ago by adampennycuick80

Can you try reformat.sh from BBMap suite instead of bam2fastq?

Something like: reformat.sh in=lib_002_mapped.sort.bam out1=lib_002_mapped.1.fastq out2=lib_002_mapped.2.fastq verifypaired=t primaryonly=t

Additional options you may want to try with original files:

mappedonly=f            Toss unmapped reads.
unmappedonly=f          Toss mapped reads.
pairedonly=f            Toss reads that are not mapped as proper pairs.
unpairedonly=f          Toss reads that are mapped as proper pairs.
primaryonly=f           Toss secondary alignments.  Set this to true for sam to fastq conversion.
ADD REPLYlink modified 6 months ago • written 6 months ago by genomax55k
1
gravatar for swbarnes2
6 months ago by
swbarnes24.0k
United States
swbarnes24.0k wrote:

The flag of 2145 indicates supplementary (not secondary) alignments I'd filter those out first.

ADD COMMENTlink written 6 months ago by swbarnes24.0k
1
gravatar for h.mon
6 months ago by
h.mon19k
Brazil
h.mon19k wrote:

I will summarize the discussion above as an answer:

Looking at the problematic reads (samtools view file.bam | grep -n "6:2224:32617:20858") revealed they are supplementary alignments.

I think you have cracked it - the supplementary alignments are being lost. However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome.

You are right in that supplementary reads may represent structural variants, but supplementary reads are a (partial) copy of the primary reads - at least in the example you selected. If you look at the reads you grepped, all three supplementary reads (flag 2145) are contained within the corresponding primary read (flag 97) - so you are recovering all original reads with your procedure. The primary reads alignment is soft-clipped, so the read is completely represented at the sam record. Supplementary alignments are hard-clipped, so only a fragment of the original read is represented (but I think there are BWA flags that may change this behaviour).

There is a discussion at the samtools github issues page about what are supplementary reads.

ADD COMMENTlink modified 6 months ago • written 6 months ago by h.mon19k
0
gravatar for Devon Ryan
6 months ago by
Devon Ryan84k
Freiburg, Germany
Devon Ryan84k wrote:

At least with samtools fastq you seem to be forgetting -s, which is where the missing reads would be.

ADD COMMENTlink written 6 months ago by Devon Ryan84k

Thanks Devon, but that's not it. When I add the -s option it returns an empty file. And this doesn't explain the behaviour of bamToFastq.

ADD REPLYlink written 6 months ago by adampennycuick80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1654 users visited in the last hour