Question: Capping coverage in bam file
7
gravatar for Danielk
3.3 years ago by
Danielk550
Karolinska Institutet, Stockholm, Sweden
Danielk550 wrote:

Colleagues, 

I have some BAM files with excessive coverage in certain regions that I want to cap to a given coverage. In regions with lower coverage, I do not want the coverage to change. Here's an illustration of what I want: 

I want read pairs to be intact in the output file. 

Does anyone have a solution to this? I want to output a BAM file for other tools do use. Google has not been my friend. 

(note: GATK PrintReads using -dcov looked promising until a realised that "For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position" from https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--downsample_to_coverage. Close but no CIGAR (pun intended) ). 

coverage downsample bam • 3.3k views
ADD COMMENTlink modified 3.3 years ago by Christian2.7k • written 3.3 years ago by Danielk550
1

You can achieve a very similar effect by normalizing the fastq data prior to mapping.  This randomly discards reads with high kmer coverage to achieve a specified limit.  For example, with BBNorm:

bbnorm.sh in=reads.fq out=normalized.fq min=0 target=100

This will flatten peaks with over 100x coverage, resulting in capped coverage as in your diagram.  The output will start to diverge from what you would get by reducing coverage based on actual mapping results if you have a highly repetitive genome, but on a mammalian or bacterial genomes it should be very similar.  As a result, it's particularly useful if you don't have a reference, or if you have too much data to map - normalization is faster than mapping to a large reference.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Brian Bushnell16k

we cannot see the picture, please use something else than dropbox(?) : e.g: http://imgur.com/

why is "excessive coverage" a problem ?

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum114k

Thanks for the heads up with the image - it now lives on imgur.

Excessive coverage in certain regions causes some downstream tools to be slow and increase RAM usage. I'd like to fix this by capping the coverage in the file, rather than downsampling, since I'm happy with the coverage in most parts of the genome. 

ADD REPLYlink written 3.3 years ago by Danielk550

do you have a BED with the regions of interest ?

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum114k

I guess I could generate a BED of the regions with high coverage for each BAM if that helps. So, sure-ish. :) 

How does it help?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Danielk550
8
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

I wrote two tools:

* the first sorts a BAM on REF/read-name https://github.com/lindenb/jvarkit/wiki/SortSamRefName

* the second reads the BAM above. For each set of reads having the same name, it checks if after adding those reads, the depth would be lower than the expected depth. If true, those reads are added to the BAM. Edit: doc is https://github.com/lindenb/jvarkit/wiki/Biostar154220

$ java -jar dist/sortsamrefname.jar -T bam /input.bam  |\
  java -jar dist/biostar154220.jar  -n 20 -T bam |\
  samtools sort - output


$ samtools mpileup output.bam  | cut -f 4 | sort | uniq -c

  12692 0
 596893 1
  94956 10
  56715 11
  76947 12
  57912 13
  66585 14
  51961 15
  63184 16
  47360 17
  65189 18
  65014 19
 364524 2
 169064 20
  72078 3
 118288 4
  54802 5
  82555 6
  53175 7
  78474 8
  54052 9

(what are those 21 and 22 ? I suppose it's because i don't consider supplementary, secondary, duplicate reads ) (bug fixed). Anyway, can you tell me if that's what you need ?

Pierre.

 

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Pierre Lindenbaum114k

Brilliant! I'll give it a spin with some real data and get back to you.

ADD REPLYlink written 3.3 years ago by Danielk550
1

small update: reads with MAPQ=0 are ignored now. https://github.com/lindenb/jvarkit/commit/484ae4fc0e3333893aa6f343db8774f0ede40dda#diff-98f5c184931195eb78ac6dbdaa58f976R165

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum114k
1

This works great. Here's a region where the bam in the lower panel was capped to 200x. (It's a bit hard to see, but the range in the upper panel is 0-2000, and the lower 0-200). 

 

An interesting effect in that in the two slightly lower peaks in input.bam, the original coverage is 1000+, but in output it's not 200, but 160 in the left peak, and around 90 in the right peak (rather than the set cap of 200x). I'm assuming that the reason is related to the read pairs insert size (here around 450), since they are added as pairs. I've read the source but haven't been able to understand the implications of the algorithm completely. 

input.bam: https://dl.dropboxusercontent.com/u/1769569/input.bam
output.bam: https://dl.dropboxusercontent.com/u/1769569/output.bam

commands:

~/bin/jvarkit/dist-1.133/sortsamrefname -T bam /proj/b2010040/input.bam |\
~/bin/jvarkit/dist-1.133/biostar154220 -n 200 -T bam |\
~/bin/autoseq/tools/bin/samtools sort /dev/stdin /proj/b2010040/output

edit: for future reference, input.bam contains reads mapped to M on GRCh37.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Danielk550
1

Thanks for letting me know that it works ( i want to be the first author of any paper ! :-P )

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum114k

Hey! You have amazing tools, thank you! But I'm struggling to do this exactly as you say:

$ java -jar ~/Programs/jvarkit/dist/sortsamrefname.jar -T bam /input.bam

[SEVERE][Launcher]There was an error in the command line arguments. Please check your parameters : Expected one or zero argument but got 3 : [-T, bam, /input.bam]

Should be something basic! Cheers,

ADD REPLYlink modified 9 days ago • written 9 days ago by ricardoguerreiro21210
1

the tool has changed since this post: http://lindenb.github.io/jvarkit/SortSamRefName.html

try:

 java -jar ~/Programs/jvarkit/dist/sortsamrefname.jar  input.bam
ADD REPLYlink modified 9 days ago • written 9 days ago by Pierre Lindenbaum114k

So fast, thanks! It was basic, I was tired. Now I have a more relevant question. Do I have to create a sequence diccionary for this to work?

Ignoring SAM validation error due to lenient parsing: Error parsing text SAM file. RG ID on SAMRecord not found in header: 10x; File tmp.bam; Line 6 Line: J00137:124:HV5G7BBXX:4:1121:15016:43902 147 1 5 0 120S27M4S = 5 -27 GGTGGTGGGTGGTTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTTGTGTTTAGATGTGGCGGTGTGTTTATGCTCTGGGTCACATTTAAAGGTCAAATGTGACCCAGAGCATAAA 7)--)))-----F-JJ7FA-JJJFFJAJJJJFFFFJAAJJJFJJJJFJJFFFJJJJJJJ77JAJFJJF7AAJJJFJFAFAAA MD:Z:27 PG:Z:MarkDuplicates RG:Z:10x NM:i:0 AS:i:27 XS:i:27 [SEVERE][SortSamRefName]Reference index for '1' not found in sequence dictionary. java.lang.IllegalArgumentException: Reference index for '1' not found in sequence dictionary.

ADD REPLYlink modified 9 days ago • written 9 days ago by ricardoguerreiro21210
1

I imagine you're working with a bam without header. Most/All mapping tools include a sequence dictionary. This read is mapped on contig '1' but there is no '1' in the header.

ADD REPLYlink written 9 days ago by Pierre Lindenbaum114k

Once again simple, merci bien!

Keep up the helpfulness, it really makes a difference for starting researchers like me.

ADD REPLYlink written 9 days ago by ricardoguerreiro21210
1
gravatar for Christian
3.3 years ago by
Christian2.7k
Cambridge, US
Christian2.7k wrote:

You could also use "digital normalization" to address this problem https://arxiv.org/abs/1203.4802.

ADD COMMENTlink modified 22 months ago by Ian5.3k • written 3.3 years ago by Christian2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 776 users visited in the last hour