MAF format help
2
0
Entering edit mode
5.4 years ago

Does anyone know the format of *.MAF files generated my ucsc tools mafFrags and mafFrafg?

i've ran a command:

mafFrags -refCoords dm3 multiz15way dm3_mrp1.bed mrp1_maffrags.maf

where dm3_mrp1.bed contain just one line:

chr2L   12727116        12737959        mrp1_dm3        1       +

which generated a file with some alignment

maf version=1 scoring=zero
a score=0.000000
s dm3.chr2L 12727116 10843 + 23011544 GTA--AGTCCAT ...
s droYak2          0 10123 +    10123 GTA--AGTCCAT ...
s droPer1          0  8480 +     8480 GTG--AGTCCAT ...
s droMoj3          0  9415 +     9415 GTG--AGTTCGC ...
s droEre2          0 10315 +    10315 GTA--AGTCCGC ...
s droSec1          0  9144 +     9144 GTA--AGTCCAT ...
s droVir3          0  8449 +     8449 GTG--AGTTCAT ...
s droGri2          0  8542 +     8542 GTG--AGTTCAT ...
s droWil1          0  8825 +     8825 GTA--AGCTGAT ...
s dp4              0  8633 +     8633 GTG--AGTCCAT ...
s droAna3          0  9907 +     9907 GTG--AGTTCAT ...
s droSim1          0  8511 +     8511 GTA--AGTCCAT ...
s anoGam1          0  6036 +     6036 GTA......... ...
s apiMel3          0  2144 +     2144 GTA--aatttat ...
s triCas2          0  4719 +     4719 GTGAGTGTTTGT ...

What does this mean? How do i get actual genomic regions (i need to look at annotation of these genomes eventually)?

For example how to obtain droPer1 genomic regions, given a line

s droPer1 0 8480 + 8480

?

ucsc mafFrags • 3.8k views
ADD COMMENT
0
Entering edit mode

Also an additional question:

Is there a way to visualize maf file? i've tried IGV browser, but it seems to fail importing the alignment.

ADD REPLY
0
Entering edit mode

There's another post on this topic: MAF multiple alignment file viewer

ADD REPLY
2
Entering edit mode
5.4 years ago

You can find the MAF format description from UCSC here. This is a fairly standard file format and is also used by Ensembl.

There's a section there that says 'lines starting with s' which is your sequences within your alignment block. If you want to define your genomic coordinates you will need to do a conversion as detailed here using the source and start coordinates which is the 2nd and 3rd column respectively:

src -- The name of one of the source sequences for the alignment. For sequences that are resident in a browser assembly, the form 'database.chromosome' allows automatic creation of links to other assemblies. Non-browser sequences are typically reference by the species name alone. start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

So I think this this example: s droPer1 0 8480 + 8480, it is chromosome 1, same start coordinate as the source (dm3.chr2L, 12727116) and the length of the alignment is 8,480bp.

The Ensembl MAF readme has a bit more information on calculating the coordinates, but it seems we store a bit more information in the 2nd column with regard to the coordinates.

ADD COMMENT
0
Entering edit mode

Thank you for your answer and links - that helps a lot!

But why in my example the chromosome is 1? Did you mean 2L - the same as dm3?

ADD REPLY
1
Entering edit mode

No worries. Yes I think that it is correct. The MAF file stores information on alignment blocks, regions of the genomes that are similar. Whichever pipeline was used to create the multiple alignment will have searched across the genome for similar regions. There's no guarantee that the most similar region/block is on the same chromosome in all species.

There's a section in the Ensembl comparative genomics paper that explains the methodology behind multiple alignments if you're interested.

ADD REPLY
0
Entering edit mode

Thank you again!

I'm getting weird results checking the alignment region: I've done blat of a droPer1's region from the alignment, and i got this hit:

browser details YourSeq  1377     1  1377  1377   100.0%  super_68    -       78837     80213   1377

What is this super_68 chromosome?

ADD REPLY
0
Entering edit mode

I'm not sure... it may refer to a super contig (long stretch of sequence not yet assembled into chromosomes). Or a part of an alternate assembly (i.e. not part of the main chromosome level assembly, a haplotypic region). I don't know what 'DroPer1' is so I can't really comment further.

ADD REPLY
0
Entering edit mode

its The October 2005 Drosophila persimilis genome assembly, which was produced by the Broad Institute at MIT and Harvard. (link)

It looks like there is no chromosomal assembly for this fly - only contigs

Thank you anyway for your help!

ADD REPLY
1
Entering edit mode
5.4 years ago

In case anyone else will also encounter this problem:

mafFrag doesn't provide regions of genomes in the alignment. to get them, you will have to use mafsInRegion tool:

$ mafsInRegion
mafsInRegion - Extract MAFS in a genomic region
usage:
    mafsInRegion regions.bed out.maf|outDir in.maf(s)
options:
    -outDir - output separate files named by bed name field to outDir
    -keepInitialGaps - keep alignment columns at the beginning and of a block that are gapped in all species
ADD COMMENT

Login before adding your answer.

Traffic: 2503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6