we can do the same by counting the number of lines in coverage.bed

Question

panel specific general statistics

0

Entering edit mode

6.8 years ago

simo017 ▴ 10

I am a total novest in bioinformatics and looking for some kind of guidance/suggestion.

If I have a Bam/vcf file of whole exome and I am interested to do a panel sequencing (i.e. a 100 genes panel) general statistic, for instance to find the number of exons, Bases, median coverage, bases where coverage is 15x, 30x, percent of coverage >20x etc for that specific panel.

Is there any tool to go for? if not, what would be the right approach to do the job?

Any help is very appreciated

panel exome coverage statistic ngs • 2.2k views

ADD COMMENT • link 6.7 years ago by simo017 ▴ 10

0

Entering edit mode

Hi,

May be you can have a look on bedtools : http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

You will need the position of your genes in a bed format.

Best

ADD REPLY • link 6.8 years ago by Titus ▴ 910

0

Entering edit mode

Thank you very much @Titus. Is that mean if I want a pipline to do that I need first to use i.e Awk to convert the genes_names.bed to genes_location.bed and then feed it to a bedtool command. right?

ADD REPLY • link 6.8 years ago by simo017 ▴ 10

0

Entering edit mode

Hi,

Yes you will need to convert you gene list with genomic position. You can use AWK or have looks here : https://genome.ucsc.edu/cgi-bin/hgTables .

ADD REPLY • link 6.8 years ago by Titus ▴ 910

0

Entering edit mode

@Titus I tried to follow your instruction by creating a bed file from a list of gene names from your link and then used the following command with samtools, samtools bedcov /path_to/mygenes1.bed /path_to/mybamfile.bam

But I got the following error: Errors in BED line 'chr1 2524251 2524425 ENST00000469962_exon_1_20_chr1_2524272_r 0 -' Errors in BED line 'chr1 2525232 2525532 ENST00000469962_exon_2_20_chr1_2525253_r 0 -' Errors in BED line 'chr1 2537699 2537825 ENST00000509374_exon_0_20_chr1_2537720_r 0 -' Errors in BED line 'chr1 2538392 2538528 ENST00000509374_exon_1_20_chr1_2538413_r 0 -' Errors in BED line 'chr1 2540757 2540878 ENST00000509374_exon_2_20_chr1_2540778_r 0 -' Errors in BED line 'chr1 2541088 2541290 ENST00000509374_exon_3_20_chr1_2541109_r 0 -' Errors in BED line 'chr1 2542699 2542860 ENST00000509374_exon_4_20_chr1_2542720_r 0 -' Errors in BED line 'chr1 2561176 2561656 ENST00000511099_exon_0_20_chr1_2561197_r 0 -' Errors in BED line 'chr1 2564284 2564438 ENST00000511099_exon_1_20_chr1_2564305_r 0 -' Errors in BED line 'chr1 2563986 2564134 ENST00000427302_exon_0_20_chr1_2564007_f 0 +' Errors in BED line 'chr1 2567774 2568079 ENST00000427302_exon_1_20_chr1_2567795_f 0 +'

Mygenes1.bed looks like that: track name="tb_ensGene" description="table browser query on ensGene" visibility=3 url= chr1 66999045 66999110 ENST00000237247_exon_0_20_chr1_66999066_f 0 + chr1 66999908 67000071 ENST00000237247_exon_1_20_chr1_66999929_f 0 + chr1 67091509 67091613 ENST00000237247_exon_2_20_chr1_67091530_f 0 + chr1 67098732 67098797 ENST00000237247_exon_3_20_chr1_67098753_f 0 + chr1 67099742 67099866 ENST00000237247_exon_4_20_chr1_67099763_f 0 + chr1 67105439 67105536 ENST00000237247_exon_5_20_chr1_67105460_f 0 + chr1 67108472 67108567 ENST00000237247_exon_6_20_chr1_67108493_f 0 + chr1 67109206 67109422 ENST00000237247_exon_7_20_chr1_67109227_f 0 + chr1 67126175 67126227 ENST00000237247_exon_8_20_chr1_67126196_f 0 + chr1 67133192 67133244 ENST00000237247_exon_9_20_chr1_67133213_f 0 + chr1 67136657 67136722 ENST00000237247_exon_10_20_chr1_67136678_f 0 + chr1 67137606 67137698 ENST00000237247_exon_11_20_chr1_67137627_f 0 + chr1 67138943 67139069 ENST00000237247_exon_12_20_chr1_67138964_f 0 + chr1 67142666 67142799 ENST00000237247_exon_13_20_chr1_67142687_f 0 + ..... Any idea what went wrong?

ADD REPLY • link 6.7 years ago by simo017 ▴ 10

0

Entering edit mode

Hi, Your bed looks good , what do you have per line ? something like that :

chr1 66999045 66999110 ENST00000237247_exon_0_20_chr1_66999066_f 0 +

ADD REPLY • link 6.7 years ago by Titus ▴ 910

0

Entering edit mode

Yes, exactly multi-lines in a format like that

ADD REPLY • link 6.7 years ago by simo017 ▴ 10

0

Entering edit mode

Is it tabulation or simple space between columns ?

ADD REPLY • link 6.7 years ago by Titus ▴ 910

0

Entering edit mode

I think it is a tabulation.

I also tried the bedtools command: bedtools coverage -abam /path_to_bam/file.bam -b /path_to_bed/mygenes1.bed -d > /path_to/result.txt

but I got as a result: * WARNING: File - has inconsistent naming convention for record: 13 32890537 32890720 M01363:123:000000000-ANHGM:1:2117:25174:10912 44 +

Any idea what I am doing wrong?

ADD REPLY • link 6.7 years ago by simo017 ▴ 10

0

Entering edit mode

check your chromosome names: chr13 vs 13. Check your reference fasta for chromosome names (>) used in alignment. Your bed file has chr in chromosome names

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

example of the fasta file: @NS500441:186:HLM5FAFXX:4:11401:23513:1026 2:N:0:CAGAGAGG+CTCCTTAC CAATAAACCATGAGAAACCTACCATTGTATTGTGCTGAGAAGTGGGAGTCCTATGGCCACAAGGCTCTTGCGTCAGCAGAGAAACTGGCTGCAATTCCTCAGAGTGAGGCTTGTTNGGANAANN + AAAAA66EAEEEEE/E/EE/EAEEEEEEEEEEAEEEEEEEEEEEEEEEAEEE/EEEEEEE/EE6EEE/<aeaeeeee eaaa="" aeeee="" <="" e="" e<6eeaee="" eaaeee6="" <aa#eea#ea##<="" p="">

and here is an example of my bam file when converted to bed with bamTobed command: 3523 34 + 41276032 41276199 255,0,0 1 167 0 17 41276032 41276199 M01363:123:000000000-ANHGM:1:2108:27336:5913 21 + 41276032 41276199 255,0,0 1 167 0 17 41276032 41276199 M01363:123:000000000-ANHGM:1:1104:25649:11829 34 + 41276032 41276199 255,0,0 1 167 0 17 41276032 41276199 M01363:123:000000000-ANHGM:1:2102:24036:2255 34 + 41276032 41276199 255,0,0 1 167 0 17 41276032 41276199 M01363:123:000000000-ANHGM:1:1115:7116:17263 34 + 41276032 41276199 255,0,0 1 167 0 17 41276032 41276199 M01363:123:000000000-ANHGM:1:2114:21848:11026 21 + 41276032 41276199 255,0,0 1 167 0 17 41276032 41276199 M01363:123:000000000-ANHGM:1:1110:16865:16320 21 + 41276032 41276199 255,0,0 1 167 0

I see in the bam file that the chromosome names are only numbers without "chr" unlike the Mygenes.bed which include the "chr". Hence, since Mygenes.bed file is made by genome.ucsc web browser. is there a way to change the output in that browser to match the format I have in the bam file or the only way I can go is to change the pipline of the fasta alignement in order to get a bam/sam file without the "chr" on it?

ADD REPLY • link 6.7 years ago by simo017 ▴ 10

0

Entering edit mode

Hi, I think the easier is to had chr before number in you chromosome you can do :

cat your_ref.fa | perl -pe 's/^>/>chr/g' > your_ref_modified.fa

After you have to realign to have the good ref, and you should use chr against a number it's easier if you need to read your bam/sam file to check something :)

Best

ADD REPLY • link 6.7 years ago by Titus ▴ 910

score 0 · Answer 1 · 2017-08-24

Hi, At the end I ended up doing this:

bedtools coverage -a target.sorted_no_chr.bed -b sdx_1.bed -d > coverage.bed

awk '$8>-1 {print}' coverage.bed | wc -l

23715

we can do the same by counting the number of lines in coverage.bed

wc -l < coverage.bed

23715

how many base pairs has more than 20x coverage

cat coverage.bed | awk '$8>20 {print}' | wc -l

19848

What is the percentage of bases pairs that have more than 20x

bc -l<<<19848*100/23715

83.69386464263124604680

total number of reads of the target region

awk 'OFS="\t" {SUM += $8} END {print SUM}' coverage.bed

409763804

average coverage of the target region

awk 'OFS="\t" {SUM += $8} END {print SUM/NR}' coverage.bed

17278.7

target.sorted_no_chr.bed example content:

13      32906398        32907534        NM_000059_exon_9_10_chr13_32906409_f    0       +
13      32910391        32915343        NM_000059_exon_10_10_chr13_32910402_f   0       +
13      32918684        32918800        NM_000059_exon_11_10_chr13_32918695_f   0       +
13      32920953        32921043        NM_000059_exon_12_10_chr13_32920964_f   0       +
13      32928987        32929435        NM_000059_exon_13_10_chr13_32928998_f   0       +

sdx_1.bed example content:

13      32890537        32890720        M01363:123:000000000-ANHGM:1:2112:20339:9512    44      +
13      32890537        32890720        M01363:123:000000000-ANHGM:1:2107:18752:22562   44      +
13      32890537        32890720        M01363:123:000000000-ANHGM:1:1108:9974:20382    44      +
13      32890537        32890720        M01363:123:000000000-ANHGM:1:2112:3611:15774    44      +
13      32890537        32890720        M01363:123:000000000-ANHGM:1:2109:22374:6929    44      +

coverage.bed:

13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       1       0
13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       2       0
13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       3       0
13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       4       0
13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       5       0
13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       6       0
13      32889606        32889814        NM_000059_exon_0_10_chr13_32889617_f    0       +       7       0

However, the results was different from the one I got from Qualimap tools, is there something wrong with the commands above?