Entering edit mode
17 months ago
berndmann
▴
10
I used
samtools view {path}/{sample_name}.bam -F 4 | cut -f 10 | perl -ne \'chomp;print length($_) . "\\n"\' | sort -n | uniq -c > {path}readlength.txt
to create a read length distribution for my nebula.bam. The output looks like this:
98929 30
98321 31
85283 32
93128 33
72783 34
90507 35
81362 36
81355 37
73827 38
70665 39
82116 40
74862 41
68171 42
69581 43
65017 44
74617 45
65990 46
66215 47
63188 48
63776 49
61673 50
69611 51
63448 52
67838 53
57148 54
58645 55
56490 56
57091 57
56761 58
55588 59
53437 60
53376 61
52779 62
53832 63
52846 64
51626 65
50242 66
49143 67
51991 68
48566 69
45442 70
45470 71
42825 72
43132 73
41201 74
42314 75
37014 76
34177 77
29547 78
26587 79
23665 80
22278 81
18312 82
19352 83
16901 84
16819 85
14827 86
14269 87
12903 88
12951 89
11324 90
11640 91
9157 92
10129 93
8531 94
8585 95
7440 96
6783 97
6379 98
6730 99
5959 100
6763 101
3692 102
3033 103
2804 104
2142 105
1844 106
1234 107
1035 108
868 109
635 110
570 111
411 112
441 113
331 114
297 115
247 116
235 117
408 118
281 119
183 120
54 121
72 122
64 123
68 124
16 125
21 126
3 127
2 128
1 129
752448666 150
Is this correct for 30x WGS that the read length is almost 150bp all the time? I guess the cut after 150 is due to Illumina sequencing limitation, right?
Is there a useful way to plot such data?
No there are Illumina sequencing kits that will sequence longer (up to 300 cycles). It looks like your BAM contains 150 bp reads max but that has nothing do with 30x WGS part.
I'm just looking for some bam file with standard 30x WGS Illumina sequencing to get a feeling for their read-length distribution. Is there some example bam they provide that I missed?
That will be completely sample quality dependent. You may encounter samples where 99% of reads may be 150 bp if they are from libraries with longer inserts.