Question: Triplet frequencies in human genome
0
gravatar for 9606
11 months ago by
9606320
Italy
9606320 wrote:

Hello,

does anybody know if it exists a list of nucleotide triplets associated with their frequency in the human genome (hg19 or grch38 are both ok) ?

Of course I can count it by myself, but I just wish to save some time.

counts genome • 326 views
ADD COMMENTlink modified 11 months ago by finswimmer13k • written 11 months ago by 9606320

not in my knowledge. However if you need some idea to compute it : https://unix.stackexchange.com/questions/231213/count-number-of-a-substring-repetition-in-a-string

ADD REPLYlink written 11 months ago by Nicolas Rosewick8.8k
1

Jellyfish will do that efficiently.

ADD REPLYlink written 11 months ago by genomax83k
4
gravatar for finswimmer
11 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Based on ensembl's hg38:

$ parallel -j8 "samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa {2} \
| seqkit seq -w0 \
| tail -n+2 \
| LC_ALL=C grep -io  {1} \
| wc -l \
| awk -v kmer={1} '{print kmer,\$0}'" ::: `echo {C,A,G,T}{C,A,G,T}{C,A,G,T}|tr " " "\n"` ::: {1..22} X Y \
| awk -v OFS="\t" 'BEGIN {print "kmer", "count"} {kmer[$1] += $2} END {for (k in kmer) {print k,kmer[k]}}' \
| sort -k1 > kmer_counts.tsv

Please notice that grep matches are not overlapping. This means in case of homopolymer stretches like TTTTTT, this will be count as 2 and not 4.

ADD COMMENTlink modified 11 months ago • written 11 months ago by finswimmer13k

seqkit is not part of standard unix install and will have to be downloaded separately.

ADD REPLYlink written 11 months ago by genomax83k

samtools neither ;)

ADD REPLYlink written 11 months ago by Nicolas Rosewick8.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1451 users visited in the last hour