Question: How to generate a short sequence that does not align to the hg38?
gravatar for r.tor
6 months ago by
r.tor40 wrote:

I would like to find a short sequence (25 bp) that does not match anywhere within Human Reference Genome (hg38/hg19) through R scripts. I am not sure if there is a quick solution to do what I am looking for. So, I guess it would be solve, if I can create a list of random 25 bp fragments and then do an alignment. Ultimately, the lowest score output could be the supposed answer! If it might be a correct approach to do, could you please guide me how it can be done by R scripting. Thank you.

alignment R • 396 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by r.tor40

Just thinking aloud, but you could screen for low abundance kmers in the Human genome, and then create a random sequence from any low abundance/absent kmers.

That should reduce the likelihood of any matching, rather than a 'guess and check' approach.

ADD REPLYlink written 6 months ago by jrj.healey13k

Keep in mind also that to ask for a sequence that does not align you need to decide what makes an alignment valid (e.g, maximum number of mismatches or minimum percent identity, E-value, etc). If you don't define any constraint, then any sequence can be aligned to any other sequence by allowing enough edits. (Presumably you leave this decision to your aligner by executing it with default settings but maybe worth to think about it...)

ADD REPLYlink written 6 months ago by dariober10k

I’ll leave it to OP to edit, as I don’t like changing peoples wording if I can avoid it, but i suggest RefSeq in the thread title be changed to hg38 or whatever. It sounds like trying to create a sequence which doesn’t align anywhere in RefSeq at the moment which is probably nigh on impossible.

ADD REPLYlink written 6 months ago by jrj.healey13k

Use this for some inspiration:

ADD REPLYlink written 6 months ago by genomax69k
gravatar for shenwei356
6 months ago by
shenwei3564.7k wrote:

What's it for? 25 bp sequence not found in human genome?

A previous answer in another post suggested a good method by mapping k-mers to genome and filtering that not mapped.

Here's a brute force method by myself: counting smaller k-mers (k=14) in genome and filtering them out from all theoretical k-mers, smaller K uses less RAM and runs faster but it becomes less specific.

  1. Splitting human genome into files with single sequence using seqkit for speedup in later steps, rush is optional used for parallelization.

    # splitting
    seqkit split -s 1 Homo_sapiens.GRCh38.dna.primary_assembly.fa -O GRCh38/
    # renaming
    ls GRCh38/*.fa | rush 'mv {} {/}/$(seqkit head -n 1 {} | seqkit seq -n -i).fa'
  2. Dumping all k-mers for k = 14 using unikmer:

    # 268435455 == 1<<28 - 1
    time seq 0 268435455 | unikmer decode -k 14 | unikmer dump -K -o all.k14
    # sorting to reduce file size
    time unikmer sort all.k14.unik -o all.k14.sorted
  3. Counting k-mers in chromesomes: (Threads number (-j) depends on your RAM, mine is 48GB).

    ls GRCh38/*.fa  | rush -j 5 'unikmer count -k 14 -s -K -c -o {}.k14 {} --verbose'
  4. Finding k-mers not in GRCh38: (Threads number (-j) depends on your RAM, mine is 48GB).

    time unikmer diff all.k14.sorted.unik GRCh38/*.k14.unik -o all.k14.sorted.uniq@GRCh38 -s -j 4 --verbose
  5. Results: 20,423,600 / 268,435,456 == 7.61% 14-mers are not found in human genome:

    unikmer stats all.k14.sorted.uniq@GRCh38.unik -a
    file                              k  gzipped  compact  canonical  sorted      number
    all.k14.sorted.uniq@GRCh38.unik  14  ✓        ✕        ✓          ✓       20,423,600
  6. Checking:

    # sampling 20 k-mers
    unikmer view all.k14.sorted.uniq@GRCh38.unik -a \
        | seqkit sample -p 0.000001 | seqkit head -n 20 > t.fa
    seqkit seq -s t.fa
    seqkit locate -i -f t.fa Homo_sapiens.GRCh38.dna.primary_assembly.fa
    seqID   patternName     pattern strand  start   end     matched
    # no hit

At last, you can freely extend the 14-kmers to 25bp.

ADD COMMENTlink modified 6 months ago • written 6 months ago by shenwei3564.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1657 users visited in the last hour