How to retain SEQ and QUAL fields in BAM file after filtering?
2
0
Entering edit mode
3.6 years ago
toddknutson ▴ 60

I would like to retain the actual SEQ and QUAL fields in the SAM/BAM file after it has been filtered.

For example, I mapped a single end R1.fastq file. The resulting BAM file contains primary and supplementary alignments (this is especially true when using the bwa mem -a flag). Then the BAM was filtered based on chromosome location (using a BED file), which causes some of the primary alignments to be removed, retaining only the supplementary alignment. However, now the BAM file does not contain the SEQ and QUAL information for that read (only an asterisk (*) in their place). This is problematic for downstream analysis that needs the SEQ and QUAL info.

How can I retain SEQ and QUAL info for all alignments, both primary and supplementary in the final BAM?

bwa mem -a -Y ref.fasta R1.fastq | samtools view -b -h > R1.bam


Display one read ("readX" has both primary and supplementary alignments), where supplementary read contains (*) for SEQ and QUAL:

samtools view R1.bam | grep "readX"
readX   0   chr7    6776845 7   151M    *   0   0   ACACAACAGAGAACAAGCCTCACAGCCCAAAGACTCCAGAACCAAGAAGGAGCCCTCTAGGACAGAAAAGGGGTATGTCGTCTTCTAGCACTTCAGATGCCATCTCTGACAAAGGCGTCCTGAGACCCCAGAAAGAGGCAGTGAGTTCCAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF: NM:i:1  MD:Z:127T23 AS:i:2999   XS:i:2936   RG:Z:INV_4SMALL_Aug_17_2017_1
readX   272 chr7    6026963 0   151M    *   0   0   *   *   NM:i:4MD:Z:23A30C17A0G77    AS:i:2936   RG:Z:INV_4SMALL_Aug_17_2017_1


Filter the BAM using bedtools:

bedtools intersect -a R1.bam -b small_region.bed > R1_small_region.bam


After filtering, only supplementary alignment is retained (with missing SEQ and QUAL info):

samtools view R1_small_region.bam | grep "readX"
readX   272 chr7    6026963 0   151M    *   0   0   *   *   NM:i:4MD:Z:23A30C17A0G77    AS:i:2936   RG:Z:INV_4SMALL_Aug_17_2017_1

bam bwa samtools • 3.0k views
0
Entering edit mode

Hello toddknutson!

I suspect your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/5579/how-to-output-all-sequences-with-bwa-mem-not
This is typically not recommended as it runs the risk of annoying people in both communities.

0
Entering edit mode

I'm sorry, but this was not a cross post. I didn't see that other posting after days of google searching! But I appreciate the link to another related question!

0
Entering edit mode

can you please post a gist with a small SAM file, with the header + with a few reads with and without the SEQ/QUAL (e.g: the one with readX + header)

0
Entering edit mode

Sure, please let me know if you need anything else! I appreciate your interest/help! https://gist.github.com/toddknutson/90430a0dd736898037ed18bcd044df7f

0
Entering edit mode

Hmm, could probably write a script to fill in the missing data.

0
Entering edit mode

I was thinking about that, but I think there might be some things to consider, regarding reverse and reverse-complement alignments. I saw this example and gave up trying to write my own script.

6
Entering edit mode
3.6 years ago

A read A00223:8:H7YG3DMXX:1:1101:1163:35383 in the remote bam file below is missing SEQ/QUAL while another has this information

$wget -q -O - "https://gist.githubusercontent.com/toddknutson/90430a0dd736898037ed18bcd044df7f/raw/87c1ea5a548ac71c628d2b72f5bd6ee6415efbcd/gistfile1.txt" | grep -F "A00223:8:H7YG3DMXX:1:1101:1163:35383" A00223:8:H7YG3DMXX:1:1101:1163:35383 16 chr7 6027025 9 151M * 0 0 GCTAGAAGACAGCAGACCCCTTGTCTGTCCTAGAGGGCTCCTTCTTGGTTCTGGAGTCTTTGGGCTGTGAGGCTTGTTCTCTGTTGTGTGACGAAGAGAAAAGGCCTCTCGCAGTCTGGAAATGGACACGTCTTTTTTTTCTTCTCCAGTC FFFFFFFFFFFFFF,:FFFF,F,FFFFFFFF:FFFFFFFFF:FF::FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFF NM:i:2 MD:Z:14T7T128 AS:i:2978 XS:i:2894 RG:Z:INV_4SMALL_Aug_17_2017_1 A00223:8:H7YG3DMXX:1:1101:1163:35383 256 chr7 6776783 0 151M * 0 0 * * NM:i:6 MD:Z:17G0G109A7A2T0C10 AS:i:2894 RG:Z:INV_4SMALL_Aug_17_2017_1  get the sam/bam file, sort it on queryname using picard $ wget -q -O - "https://gist.githubusercontent.com/toddknutson/90430a0dd736898037ed18bcd044df7f/raw/87c1ea5a548ac71c628d2b72f5bd6ee6415efbcd/gistfile1.txt" |\
java -jar /path/to/picard.jar SortSam I=/dev/stdin O=/dev/stdout SO=queryname VALIDATION_STRINGENCY=LENIENT |\
java -jar dist/biostar352930.jar


output:

@HD VN:1.5  SO:queryname
(...)
@RG ID:INV_4SMALL_Aug_17_2017_1 SM:0X   PL:ILLUMINA LB:libeX
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 12 -R @RG\tID:sample1\tSM:0X\tPL:ILLUMINA\tLB:libeX -Y -a -k 11 -O 2 -B 1 -A 20 hg19_canonical_PARmaskedonchrY.fasta R1.fastq
@CO biostar352930. compilation: 2018-12-05:10-12-50 githash: 03f79caf6a62458937b2d55d0661737617334733 htsjdk: 2.15.0. cmd:
A00223:8:H7YG3DMXX:1:1101:1163:35383    256 chr7    6776783 0   151M    *   0   0   GACTGGAGAAGAAAAAAAAGACGTGTCCATTTCCAGACTGCGAGAGGCCTTTTCTCTTCGTCACACAACAGAGAACAAGCCTCACAGCCCAAAGACTCCAGAACCAAGAAGGAGCCCTCTAGGACAGACAAGGGGTCTGCTGTCTTCTAGC FFFFFF,FF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF::FF:FFFFFFFFF:FFFFFFFF,F,FFFF:,FFFFFFFFFFFFFF MD:Z:17G0G109A7A2T0C10  RG:Z:INV_4SMALL_Aug_17_2017_1   NM:i:6  AS:i:2894
A00223:8:H7YG3DMXX:1:1101:1163:35383    16  chr7    6027025 9   151M    *   0   0   GCTAGAAGACAGCAGACCCCTTGTCTGTCCTAGAGGGCTCCTTCTTGGTTCTGGAGTCTTTGGGCTGTGAGGCTTGTTCTCTGTTGTGTGACGAAGAGAAAAGGCCTCTCGCAGTCTGGAAATGGACACGTCTTTTTTTTCTTCTCCAGTC FFFFFFFFFFFFFF,:FFFF,F,FFFFFFFF:FFFFFFFFF:FF::FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFF MD:Z:14T7T128   RG:Z:INV_4SMALL_Aug_17_2017_1   NM:i:2  AS:i:2978   XS:i:2894
A00223:8:H7YG3DMXX:1:1101:15338:10113   256 chr7    6777077 0   151M    *   0   0   GTTCAGCATCCCAGACACGGGCAGTCACTGCAGCAGCGAGTATGCGGCCAGCTCCCCAGGGGACAGGGGCTCGCAGGAACATGTGGACTCTCAGGAGAAAGCGCCTGAAACTGACGACTCTTTTTCAGATGTGGACTGCCATTCAAACCAG ,FFFFF,F,FFFFFFFFFFFF:F,F:F::F,FFFFFFFFFFFFFFFFFFFFFFF,F:FFF,:FFFFFFFFFF,FFF:FFFFF,FF,F:FFFFFF:F,FFFF:,::,,FFFFF:FFFF::F,FFF,:,,F:,FF,F,FF,FFF,FFFF,FF, MD:Z:41G2T7A98  RG:Z:INV_4SMALL_Aug_17_2017_1   NM:i:3  AS:i:2957
A00223:8:H7YG3DMXX:1:1101:15338:10113   16  chr7    6026731 4   151M    *   0   0   CTGGTTTGAATGGCAGTCCACATCTGAAAAAGAGTCGTCAGTTTCAGGCGCTTTCTCCTGAGAGTCCACATGTTCCTGCGAGCCCCTGTCCCCTGGGGAGCTGGCCGCATACTCGCTGCTGCAGTGACTGCCCGTGTCTGGGATGCTGAAC ,FF,FFFF,FFF,FF,F,FF,:F,,:,FFF,F::FFFF:FFFFF,,::,:FFFF,F:FFFFFF:F,FF,FFFFF:FFF,FFFFFFFFFF:,FFF:F,FFFFFFFFFFFFFFFFFFFFFFF,F::F:F,F:FFFFFFFFFFFF,F,FFFFF, MD:Z:44T106 RG:Z:INV_4SMALL_Aug_17_2017_1   NM:i:1  AS:i:2999   XS:i:2957
A00223:8:H7YG3DMXX:1:1101:15881:12242   0   chr7    6026908 11  151M    *   0   0   GTGCCCCGAGTCCTTCTCCACCTCCGCTCTGTCCGTAGGGTCACTGGGTCCGTGACTGGAACTCACTGCCTCTTTCTGAGATCTCAGGACGCCTTTGTCAGAGATGGCACCTGAAGTGCTAGAAGACAGCATACCCCTTTTCTGTCCTAGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFF:F,FFFFFFFFFFFFFFF:FFFFFFFFF:FF:FFF:FFFF:FFFFFFFF,FFFFFFFFF MD:Z:80G70  RG:Z:INV_4SMALL_Aug_17_2017_1   NM:i:1  AS:i:2999   XS:i:2895
A00223:8:H7YG3DMXX:1:1101:15881:12242   272 chr7    6776900 0   151M    *   0   0   TCTAGGACAGAAAAGGGGTATGCTGTCTTCTAGCACTTCAGGTGCCATCTCTGACAAAGGCGTCCTGAGATCTCAGAAAGAGGCAGTGAGTTCCAGTCACGGACCCAGTGACCCTACGGACAGAGCGGAGGTGGAGAAGGACTCGGGGCAC FFFFFFFFF,FFFFFFFF:FFFF:FFF:FF:FFFFFFFFF:FFFFFFFFFFFFFFF,F:FFFFFFFFFFF:F,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MD:Z:22T0C17A28C28G50T0 RG:Z:INV_4SMALL_Aug_17_2017_1   NM:i:6  AS:i:2895
(...)


the read A00223:8:H7YG3DMXX:1:1101:1163:35383 is now fixed.

0
Entering edit mode

Awesome! Thanks so much! Our system is currently down for maintenance, but I will fully test ASAP, and then mark as "accepted".

0
Entering edit mode

@Pierre Lindenbaum: I'm glad you posted a link to this answer over on the Bioinformatics Stack Exchange question. Someone posted a comment there about getting errors downstream when using samtools view, i.e.. [E::sam_parse1] CIGAR and query sequence are of different length. I used bwa -Y -a` options when mapping, which will only soft clip sequences (not hard clip them). I think this may solve the problem?? -- however, I don't have any reputation on that site to post this comment...

0
Entering edit mode

It shouldn't be a problem, I though I handled this. I'd better like to see a minimal SAM with the problem.

0
Entering edit mode

Anyone having a similar issue and needing a fix for this, I've put together a small python script to fix the unclipped SEQ/QUAL fields for secondary reads. https://gist.github.com/knowah/24c8665334180120b94be3274c5cc706

1
Entering edit mode
3.6 years ago

The READ and QUAL data stuck with the primary alignment, and in this case, that was probably the correct call, as that read aligns to the 67Mb coordinates with fewer base discrepancies. That read likely does not align to your region.

Maybe you should filter your .bam for primary alignment only, then all reads will have their READ information for downstream processing.

0
Entering edit mode

Thanks, that's a good idea under most circumstances. However, in my situation, I need the supplementary alignments. I am mapping reads against a region that contains both a protein encoding gene and pseudogene, with zero bases that distinguish the two features. Thus, many of my supplementary reads are actually valid alignments (despite the primary being mapped to the pseudogene).

0
Entering edit mode

If the target sequence that shares 100% homology is greater than the read length, how can you ever possibly distinguish them? I have come across various regions of the genome that are 'impossible' to faithfully sequence by short-read NGS, such as VWF and exons of some of the CYP genes. I am sure that you are aware of this.

1
Entering edit mode

The convention (sam specifications) is just that for a multimapping read the SEQ and QUAL is only stored once to limit file size. Ideally that would be the best alignment, but in all other cases it's just a random alignment that will be the primary.

0
Entering edit mode

Thanks for you comment -- you are correct. This is my situation as well. I am working on PMS2 and PMS2CL (pseudogene). My strategy is to map the R1 and R2 reads independently with bwa -a to get all alignments (which results in one alignment being in PMS2 and one being in PMS2CL). This results in one read being marked "primary", but that is likely arbitrary. Then, I filter the alignments based on PMS2 region only, and combine the two PMS2-only BAMs, and use my own criteria for "good pairs" to further select only good R1/R2 pairs. I think this method might (?) be better than masking PMS2CL and forcing all reads to only map to PMS2. I am basically using the strategy published by these researchers https://www.ncbi.nlm.nih.gov/pubmed/30268105.

0
Entering edit mode

Oh, I have come across that gene before in relation to mismatch-repair and endometrial cancer. I see that the authors are from Counsyl. The method looks convincing. Others here are more knowledgeable of the SAM specification and BAM encodings than I, and you seem to have got the info that you needed. I was more interested in just the biology part. Thank you for sharing.

0
Entering edit mode

No problem! Yes, this is a case where the biology and the informatics are complex and interesting!