Question: Triplet frequencies in human genome
0
gravatar for 9606
5 weeks ago by
9606320
Italy
9606320 wrote:

Hello,

does anybody know if it exists a list of nucleotide triplets associated with their frequency in the human genome (hg19 or grch38 are both ok) ?

Of course I can count it by myself, but I just wish to save some time.

counts genome • 142 views
ADD COMMENTlink modified 4 weeks ago by finswimmer11k • written 5 weeks ago by 9606320

not in my knowledge. However if you need some idea to compute it : https://unix.stackexchange.com/questions/231213/count-number-of-a-substring-repetition-in-a-string

ADD REPLYlink written 5 weeks ago by Nicolas Rosewick7.9k
1

Jellyfish will do that efficiently.

ADD REPLYlink written 5 weeks ago by genomax70k
4
gravatar for finswimmer
4 weeks ago by
finswimmer11k
Germany
finswimmer11k wrote:

Based on ensembl's hg38:

$ parallel -j8 "samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa {2} \
| seqkit seq -w0 \
| tail -n+2 \
| LC_ALL=C grep -io  {1} \
| wc -l \
| awk -v kmer={1} '{print kmer,\$0}'" ::: `echo {C,A,G,T}{C,A,G,T}{C,A,G,T}|tr " " "\n"` ::: {1..22} X Y \
| awk -v OFS="\t" 'BEGIN {print "kmer", "count"} {kmer[$1] += $2} END {for (k in kmer) {print k,kmer[k]}}' \
| sort -k1 > kmer_counts.tsv

Please notice that grep matches are not overlapping. This means in case of homopolymer stretches like TTTTTT, this will be count as 2 and not 4.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by finswimmer11k

seqkit is not part of standard unix install and will have to be downloaded separately.

ADD REPLYlink written 4 weeks ago by genomax70k

samtools neither ;)

ADD REPLYlink written 4 weeks ago by Nicolas Rosewick7.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour