How to get the sequence differences between multiple bacterial genomes
1
1
Entering edit mode
3.4 years ago
Rose ▴ 20

I am working on some closely related bacterial species (complete genomes from NCBI). I would like to extract the sequence differences between them. To be more specific, I want to find unique sequences (50 -100 nts) in each of the bacterial species (n=6) under my study. I found many other related posts suggesting tools like Mauve, Mummer, Artemis ACT etc.. and I tried all of them. Mauve gave pretty good results, but when I extracted the sequences in the range as suggested to be different from other genomes under comparison, and performed blast, I got multiple hits to other bacterial species.

Thanks in advance.

genome alignment • 3.7k views
ADD COMMENT
1
Entering edit mode

If you want the sequence to be inclusive/exclusive to your specific genome then you need to include a representative set of all available bacterial genomes in your exclusion set..

ADD REPLY
0
Entering edit mode

Yes, excluding similar sequences in human genome would be better.

ADD REPLY
4
Entering edit mode
3.4 years ago

Here's one solution with unikmer, for limited number of input files. for loop can be replaced with parallel for saving time.

  1. Generating k-mers.

    for f in *.fa.gz; do
        unikmer count -k 31 -K -s -o $f.k31 $f;
    done
    
  2. Computing k-mers shared by two or more strains. (updated here)

    unikmer common -n 2 *.k31.unik -o shared
    
    # method 2, faster
    unikmer sort  *.k31.unik --repeated -o shared
    
  3. Removing k-mers shared by two or more strainss, left are unique k-mers.

    for f in *.k31.unik; do 
        unikmer diff $f shared.unik -s -o $f.uniq;
    done
    
  4. Retrieving unique sequences from genome with unique k-mers. Output format can be in BED (default) or FASTA (-a), minimum lengths (-m) are configurable.

    for f in *.fa.gz; do
        unikmer uniqs -g $f $f.k31.unik.uniq.unik -M -m 50 -a -o $f.uniq.fa;
    done
    
  5. Stats of result.

    unikmer stats *.uniq.unik -a 
    file                                            k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version  number
    1773-GCA_000706665.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  16,217
    1773-GCA_000738445.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  11,939
    1773-GCA_000738475.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   7,074
    1773-GCA_000756525.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   9,679
    1773-GCA_000756545.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  72,867
    1773-GCA_000934585.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  20,135
    
    seqkit stats *.uniq.fa
    file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
    1773-GCA_000706665.1.fa.gz.uniq.fa  FASTA   DNA        358   28,030       50     78.3    1,005
    1773-GCA_000738445.1.fa.gz.uniq.fa  FASTA   DNA        379   23,088       50     60.9       89
    1773-GCA_000738475.1.fa.gz.uniq.fa  FASTA   DNA        220   13,486       50     61.3       99
    1773-GCA_000756525.1.fa.gz.uniq.fa  FASTA   DNA        239   16,306       50     68.2      354
    1773-GCA_000756545.1.fa.gz.uniq.fa  FASTA   DNA      1,888  129,254       50     68.5    6,865
    1773-GCA_000934585.1.fa.gz.uniq.fa  FASTA   DNA        479  211,054       50    440.6    9,400
    
ADD COMMENT
0
Entering edit mode

Perfect!! Thanks very much @shenwei356. This is exactly what I was looking for.

ADD REPLY
1
Entering edit mode

Sorry, I just notice step 2 used the wrong command, and have edited the answer.

We should removing k-mers shared by >= 2 genomes (unikmer common), not just that shared by all genomes (unikmer inter).

ADD REPLY
0
Entering edit mode

I used this method and I can see that I got unique sequences from my genome, I sorted them according to length using Jalview and blasted them in NCBI blast. Just my question, Is using this method is the best to design PCR primers uniquely for my genome of interest?

ADD REPLY
0
Entering edit mode

Once you got the unique sequence,s then you can design PCR primers using primer blast which designs primers and check the specificity.

Another tool: Fur: Find unique genomic regions for diagnostic PCR.

ADD REPLY
1
Entering edit mode

Thank sir for your answer, much appreciated.

I tried to use the FUR tool via Docker. But It did not give me any unique sequences. maybe because the sequences are very closely related and the bacteria is quite conserved. However, your tool provided me with around 37 sequences with an average len of 60 bp. Uniqueness for them stems from SNPs.(not sure, but I tried the mos lengthy) is there a way to sort them from the perspective of which sequence (out of the 37 ) has the higher variability? instead of manually blasting them one by one?

ADD REPLY
0
Entering edit mode

Sorry, I did not go that far. Please Blast and check them for now.

ADD REPLY
0
Entering edit mode

I blasted them. And indeed I have found what I was looking for. One non-coding fragment lies in an intergenic sequence that represents the best marker for my genome among its own lineage. So is unikemr has an upper hand over FUR if you dealing with very close genomes comparing them with each other (ex: inside a uropathogenic E. coli). Fur is better if you are comparing inside a uropathogenic E. coli to intestinal pathogneic E. coli. or further. Any way my judemnet is based on my own experience. Thanks for your help

ADD REPLY
0
Entering edit mode

use cd-hit with high identity cut-off. optimize identity cutoff till results catch the differences. cd-hit clusters sequences based on user furnished identity cut offs for nucleotide sequences.

ADD REPLY
0
Entering edit mode

Thanks for your suggestions.

ADD REPLY

Login before adding your answer.

Traffic: 2238 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6