Compare genotype genome sequences at basepair level
1
3
Entering edit mode
16 months ago
kashiff007 ★ 1.9k

I know there are several methods to compare two closely related genomes; genotype. But almost all of them compare the stretch of matching sequences. For example blastn table output looks like:

 1.  qseqid      query or source (e.g., gene) sequence id
2.  sseqid      subject  or target (e.g., reference genome) sequence id
3.  pident      percentage of identical matches
4.  length      alignment length (sequence overlap)
5.  mismatch    number of mismatches
6.  gapopen     number of gap openings
7.  qstart      start of alignment in query
8.  qend        end of alignment in query
9.  sstart      start of alignment in subject
10.  send        end of alignment in subject
11.  evalue      expect value
12.  bitscore    bit score


Although, blast provide one is-to many hits, there are many programs which provide one is-to one best hits. Like MashMap which provides a table of matching hits, with first half column has matching stretch from query (first genotype) and last half has information of corresponding hits from second genotype.

 q_chr.  len.    start.  end.    match.  ori. s_chr. len.    start.  end.    match. %_match
Chr1    9589196 0       4999    4999    +    Chr2   2416959 832373  837372  4999    94.2732
Chr1    9589196 10000   49999   39999   -    Chr3   6191988 4926167 4964747 38580   94.7695
Chr1    9589196 30000   224999  194999  -    Chr4   892816  441386  645465  204079  96.8471


The main demerit of this output/or blast output is the mismatch/indel location information is not reported.

I am looking for a program which compare two genotype (in one is-to one hit) and provide the information of each base in tabular form. The output may look like:

Chr1   1    C      Chr1   1    C
Chr1   2    G      Chr1   2    A
Chr1   3    G      Chr1   3    G
Chr1   4    T      insert      _
insert      _      Chr1   5    C
Chr1   5    G      Chr1   6    C
Chr1   6    A      Chr1   7    A

genotype comparative-genomics • 469 views
4
Entering edit mode
16 months ago

I have recently explored various alternatives to a similar problem and came away with the following potential solutions:

## Solution 1

The "easiest" to do this would be to generate a VCF variant file with a SNP calling tool, then transform that variant file into a tabular file with bcftools view. If you don't have reads and the variants are mostly short SNVs you could pretend you ran an experiment, simulate FASTQ reads from one genome call SNPs using the other as reference. This is probably only a good solution if the SNPs are simple one.

## Solution 2

See if dipcall works for you:

dipcall is not that well documented but as I understood it it generates VCF files from genome-wide alignments.

## Solution 3

parse the cs tag yourself.

The tool that I believe operates closest to what you need is minimap2, specifically the cs tag that can be computed for both BAM and PAF output.

That cs tag is a more sane CIGAR format that makes parsing out specific alignments a far easier task:

as the docs say an alignment such as :

CGATCGATAAATAGAGTAG---GAATAGCA
||||||   ||||||||||   |||| |||
CGATCG---AATAGAGTAGGTCGAATtGCA


will be represented in the cs tag of:

:6-ata:10+gtc:4*at:3


There may be tools that already do that:

# Solution 4

I also am developing a tool called bio that has many use cases, one of them is that it can format alignment as tables. Suppose input.fa contains:

>seq1
GATTACA
>seq2
GATCATA


then align with mafft:

mafft input.fa > aligned.fa


and visualize with bio:

cat aligned.fa | bio format --diff


will print:

t4c SNP 4   t   c   seq1    seq2
c6t SNP 6   c   t   seq1    seq2


See more on bio here: