How to generate a short sequence that does not align to the hg38?
1
4
Entering edit mode
5.3 years ago
r.tor ▴ 50

I would like to find a short sequence (25 bp) that does not match anywhere within Human Reference Genome (hg38/hg19) through R scripts. I am not sure if there is a quick solution to do what I am looking for. So, I guess it would be solve, if I can create a list of random 25 bp fragments and then do an alignment. Ultimately, the lowest score output could be the supposed answer! If it might be a correct approach to do, could you please guide me how it can be done by R scripting. Thank you.

R alignment • 1.5k views
ADD COMMENT
2
Entering edit mode

Just thinking aloud, but you could screen for low abundance kmers in the Human genome, and then create a random sequence from any low abundance/absent kmers.

That should reduce the likelihood of any matching, rather than a 'guess and check' approach.

ADD REPLY
2
Entering edit mode

Keep in mind also that to ask for a sequence that does not align you need to decide what makes an alignment valid (e.g, maximum number of mismatches or minimum percent identity, E-value, etc). If you don't define any constraint, then any sequence can be aligned to any other sequence by allowing enough edits. (Presumably you leave this decision to your aligner by executing it with default settings but maybe worth to think about it...)

ADD REPLY
1
Entering edit mode

I’ll leave it to OP to edit, as I don’t like changing peoples wording if I can avoid it, but i suggest RefSeq in the thread title be changed to hg38 or whatever. It sounds like trying to create a sequence which doesn’t align anywhere in RefSeq at the moment which is probably nigh on impossible.

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
5.3 years ago

What's it for? 25 bp sequence not found in human genome?

A previous answer in another post suggested a good method by mapping k-mers to genome and filtering that not mapped.

Here's a brute force method by myself: counting smaller k-mers (k=14) in genome and filtering them out from all theoretical k-mers, smaller K uses less RAM and runs faster but it becomes less specific.

  1. Splitting human genome into files with single sequence using seqkit for speedup in later steps, rush is optional used for parallelization.

    # splitting
    seqkit split -s 1 Homo_sapiens.GRCh38.dna.primary_assembly.fa -O GRCh38/
    
    # renaming
    ls GRCh38/*.fa | rush 'mv {} {/}/$(seqkit head -n 1 {} | seqkit seq -n -i).fa'
    
  2. Dumping all k-mers for k = 14 using unikmer:

    # 268435455 == 1<<28 - 1
    time seq 0 268435455 | unikmer decode -k 14 | unikmer dump -K -o all.k14
    
    # sorting to reduce file size
    time unikmer sort all.k14.unik -o all.k14.sorted
    
  3. Counting k-mers in chromesomes: (Threads number (-j) depends on your RAM, mine is 48GB).

    ls GRCh38/*.fa  | rush -j 5 'unikmer count -k 14 -s -K -c -o {}.k14 {} --verbose'
    
  4. Finding k-mers not in GRCh38: (Threads number (-j) depends on your RAM, mine is 48GB).

    time unikmer diff all.k14.sorted.unik GRCh38/*.k14.unik -o all.k14.sorted.uniq@GRCh38 -s -j 4 --verbose
    
  5. Results: 20,423,600 / 268,435,456 == 7.61% 14-mers are not found in human genome:

    unikmer stats all.k14.sorted.uniq@GRCh38.unik -a
    file                              k  gzipped  compact  canonical  sorted      number
    all.k14.sorted.uniq@GRCh38.unik  14  ✓        ✕        ✓          ✓       20,423,600
    
  6. Checking:

    # sampling 20 k-mers
    unikmer view all.k14.sorted.uniq@GRCh38.unik -a \
        | seqkit sample -p 0.000001 | seqkit head -n 20 > t.fa
    
    seqkit seq -s t.fa
    AAGCGTAAGCGGCA
    ACGCACCGAGAGTG
    ACGGATCCCGGTCG
    ACGTAGAACGGACC
    AGCAGGTGACGTCG
    AGCGCGCGTTCAGA
    ATAAGTCGCGCCCA
    ATATCGACTCCGTG
    ATGCACTTGATCGG
    ATTGACGCAGCGAA
    CACGCGGCAGCGAG
    CGCGTTGGACATAG
    CGCTCGCCAGTTGC
    CGCTTTGTCGAACA
    GACGGCCGACGATA
    GAGTCTCGCACGAA
    GATCGCATATGTTA
    
    seqkit locate -i -f t.fa Homo_sapiens.GRCh38.dna.primary_assembly.fa
    seqID   patternName     pattern strand  start   end     matched
    
    # no hit
    

At last, you can freely extend the 14-kmers to 25bp.

ADD COMMENT

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6