Question: Add head to coverage output for many files
gravatar for Cindy
2.9 years ago by
Cindy30 wrote:

Hi, all, I used samtools depth to output read depth for all samples, those files have three columns but no head. I want to merge all samples together to calculate the mean read depth and sd, so I need to add sample name as a header before merge them together.

When I use echo, it also output ref pos and $F as head, but what I need is that $F is the sample name. My code is

find ReadDepth/ -name "OUT_*" | while read F ; 
echo "$(echo 'ref pos $F' | cat - $F)" > ${F%}_head

Please help me to correct it.


sequencing • 774 views
ADD COMMENTlink modified 2.9 years ago by James Ashmore2.9k • written 2.9 years ago by Cindy30

Can you illustrate by an example? Btw the sequencing tag in Q is irrelevant!

ADD REPLYlink written 2.9 years ago by Santosh Anand5.1k
gravatar for James Ashmore
2.9 years ago by
James Ashmore2.9k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.9k wrote:

You could also use the bedtools suite to create your coverage matrix with header information:

Calculate read coverage across the entire genome for multiple BAM files:

bedtools genomecov -bga -i sample1.bam -g genome.txt > sample1_genomecov.bedGraph
bedtools genomecov -bga -i sample2.bam -g genome.txt > sample2_genomecov.bedGraph
bedtools genomecov -bga -i sample3.bam -g genome.txt > sample3_genomecov.bedGraph

#  Easier and faster to do this with GNU parallel if you have it installed...
parallel 'bedtools genomecov -bga -i {} -g genome.txt > {.}_genomecov.bedGraph' ::: sample*.bam

Combine read coverage files into a single file with header:

bedtools unionbedg -header -names sample1 sample2 sample3 -g genome.txt -empty -i sample1_genomecov.bedGraph sample2_genomecov.bedGraph sample3_genomecov.bedGraph > genomecov.txt

You could then write a quick script to extract the statistics you want from the coverage file:

#!/usr/bin/env python3

import collections
import csv
import numpy

with open('genomecov.txt'. 'r') as handle:
    reader = csv.reader(handle, delimiter = '\t')
    header = next(reader)
    Row = collections.namedtuple('Row', header)
    for line in reader:
        row = Row(*line)
        coverage = row[3:]
        # Print the mean and standard deviation
        print(row.chrom, row.start, row.end, numpy.mean(coverage), numpy.std(coverage))

There are probably more elegant solutions available, but this should work if you're in a squeeze...

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by James Ashmore2.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1124 users visited in the last hour