Question

Add head to coverage output for many files

0

Entering edit mode

6.8 years ago

Cindy ▴ 60

Hi, all, I used samtools depth to output read depth for all samples, those files have three columns but no head. I want to merge all samples together to calculate the mean read depth and sd, so I need to add sample name as a header before merge them together.

When I use echo, it also output ref pos and $F as head, but what I need is that $F is the sample name. My code is

find ReadDepth/ -name "OUT_*" | while read F ; 
do
echo "$(echo 'ref pos $F' | cat - $F)" > ${F%}_head
done

Please help me to correct it.

Thanks

sequencing • 1.9k views

ADD COMMENT • link updated 6.8 years ago by James Ashmore ★ 3.4k • written 6.8 years ago by Cindy ▴ 60

0

Entering edit mode

Can you illustrate by an example? Btw the sequencing tag in Q is irrelevant!

ADD REPLY • link 6.8 years ago by Santosh Anand 5.7k

score 1 · Answer 1 · 2017-07-20

You could also use the bedtools suite to create your coverage matrix with header information:

Calculate read coverage across the entire genome for multiple BAM files:

bedtools genomecov -bga -i sample1.bam -g genome.txt > sample1_genomecov.bedGraph
bedtools genomecov -bga -i sample2.bam -g genome.txt > sample2_genomecov.bedGraph
bedtools genomecov -bga -i sample3.bam -g genome.txt > sample3_genomecov.bedGraph

#  Easier and faster to do this with GNU parallel if you have it installed...
parallel 'bedtools genomecov -bga -i {} -g genome.txt > {.}_genomecov.bedGraph' ::: sample*.bam

Combine read coverage files into a single file with header:

bedtools unionbedg -header -names sample1 sample2 sample3 -g genome.txt -empty -i sample1_genomecov.bedGraph sample2_genomecov.bedGraph sample3_genomecov.bedGraph > genomecov.txt

You could then write a quick script to extract the statistics you want from the coverage file:

#!/usr/bin/env python3

import collections
import csv
import numpy

with open('genomecov.txt'. 'r') as handle:
    reader = csv.reader(handle, delimiter = '\t')
    header = next(reader)
    Row = collections.namedtuple('Row', header)
    for line in reader:
        row = Row(*line)
        coverage = row[3:]
        # Print the mean and standard deviation
        print(row.chrom, row.start, row.end, numpy.mean(coverage), numpy.std(coverage))

There are probably more elegant solutions available, but this should work if you're in a squeeze...