Question: How to get the sequence differences between multiple bacterial genomes
0
gravatar for Rose
7 weeks ago by
Rose0
India
Rose0 wrote:

I am working on some closely related bacterial species (complete genomes from NCBI). I would like to extract the sequence differences between them. To be more specific, I want to find unique sequences (50 -100 nts) in each of the bacterial species (n=6) under my study. I found many other related posts suggesting tools like Mauve, Mummer, Artemis ACT etc.. and I tried all of them. Mauve gave pretty good results, but when I extracted the sequences in the range as suggested to be different from other genomes under comparison, and performed blast, I got multiple hits to other bacterial species.

Thanks in advance.

alignment genome • 247 views
ADD COMMENTlink modified 7 weeks ago by shenwei3565.7k • written 7 weeks ago by Rose0
1

If you want the sequence to be inclusive/exclusive to your specific genome then you need to include a representative set of all available bacterial genomes in your exclusion set..

ADD REPLYlink written 7 weeks ago by 5heikki9.1k

Yes, excluding similar sequences in human genome would be better.

ADD REPLYlink written 7 weeks ago by shenwei3565.7k
1
gravatar for shenwei356
7 weeks ago by
shenwei3565.7k
China
shenwei3565.7k wrote:

Here's one solution with unikmer, for limited number of input files. for loop can be replaced with parallel for saving time.

  1. Generating k-mers.

    for f in *.fa.gz; do
        unikmer count -k 31 -K -s -o $f.k31 $f;
    done
    
  2. Computing k-mers shared by two or more strains. (updated here)

    unikmer common -n 2 *.k31.unik -o shared
    
    # method 2, faster
    unikmer sort  *.k31.unik --repeated -o shared
    
  3. Removing k-mers shared by two or more strainss, left are unique k-mers.

    for f in *.k31.unik; do 
        unikmer diff $f shared.unik -s -o $f.uniq;
    done
    
  4. Retrieving unique sequences from genome with unique k-mers. Output format can be in BED (default) or FASTA (-a), minimum lengths (-m) are configurable.

    for f in *.fa.gz; do
        unikmer uniqs -g $f $f.k31.unik.uniq.unik -M -m 50 -a -o $f.uniq.fa;
    done
    
  5. Stats of result.

    unikmer stats *.uniq.unik -a 
    file                                            k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version  number
    1773-GCA_000706665.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  16,217
    1773-GCA_000738445.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  11,939
    1773-GCA_000738475.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   7,074
    1773-GCA_000756525.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   9,679
    1773-GCA_000756545.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  72,867
    1773-GCA_000934585.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  20,135
    
    seqkit stats *.uniq.fa
    file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
    1773-GCA_000706665.1.fa.gz.uniq.fa  FASTA   DNA        358   28,030       50     78.3    1,005
    1773-GCA_000738445.1.fa.gz.uniq.fa  FASTA   DNA        379   23,088       50     60.9       89
    1773-GCA_000738475.1.fa.gz.uniq.fa  FASTA   DNA        220   13,486       50     61.3       99
    1773-GCA_000756525.1.fa.gz.uniq.fa  FASTA   DNA        239   16,306       50     68.2      354
    1773-GCA_000756545.1.fa.gz.uniq.fa  FASTA   DNA      1,888  129,254       50     68.5    6,865
    1773-GCA_000934585.1.fa.gz.uniq.fa  FASTA   DNA        479  211,054       50    440.6    9,400
    
ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by shenwei3565.7k

Perfect!! Thanks very much @shenwei356. This is exactly what I was looking for.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Rose0
1

Sorry, I just notice step 2 used the wrong command, and have edited the answer.

We should removing k-mers shared by >= 2 genomes (unikmer common), not just that shared by all genomes (unikmer inter).

ADD REPLYlink written 7 weeks ago by shenwei3565.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2109 users visited in the last hour
_