MEFIT is a system for microarray integration developed by Curtis Huttenhower in Olga Troyanskaya's lab at Princeton University, with generous help from Chad Myers and Matt Hibbs. As a framework, MEFIT uses the results of many microarray experiments in combination with known biological process annotations (drawn from the Gene Ontology, KEGG, MIPS, or a biologist's own pathways of interest) to predict new gene pair functional relationships within the given biological functions. Or in other words, MEFIT is a system that takes microarray results and known functional annotations as inputs and produces predicted gene pair functional relationships as output.
For example, the S. cerevisiae genes ADY2, BDF1, BFR1, and so forth are known to participate in the meiotic cell cycle, while ATP1, ATP14, ATP15, etc. are all related to hydrogen transport by way of ATP synthesis. Given these known functional categories in addition to a collection of microarray experiments, MEFIT might predict that YHR159W is related to MMS4 in meiosis (but not hydrogen transport) with high probability, while YNL274C is related to INH1 in hydrogen transport (but not meiosis). For more details (and predictions), see Huttenhower et al. 2006.
To make these predictions, MEFIT uses a Bayesian network that consumes microarray data as input observations and produces predicted functional relationships through a single unobserved (except during training) node. Furthermore, to make predictions within the context of individual biological functions, a single Bayesian network structure is replicated once per function of interest. These networks with identical structure are then trained using known functional annotations (see below) such that each function's network learns its own set of conditional probabilities. These probabilities encode how predictive each microarray experiment is of a particular function; for example, a sporulation time course might be very predictive of meiosis, but not much help in determining which genes perform ATP synthesis.
This web site showcases MEFIT's predictions over a subset of the S. cerevisiae genome (specifically, our set of held out test genes in combination with genes annotated in GO to "biological process unknown"). MEFIT's raw output is one set of pairwise probabilities of functional relationship per biological function. To make these more interpretable, we present here a hierarchical clustering for each biological function using these pairwise probabilities as a similarity score; just as you can cluster by grouping well-corrlated genes together, you can cluster by grouping genes predicted to relate with high probability.
Given a set of microarray results, MEFIT follows a four step pipeline to prepare it for integration into the Bayesian framework:
This pipeline digests each microarray PCL (or similar) file into a collection of quantized pairwise z-scores. Each z-score collection provides the observations for one node in each function of MEFIT's Bayesian framework.
The 200 Gene Ontology biological process functions used in our evaluation of MEFIT were derived from a collection of terms deemed to be biologically interesting by a panel of six biologists. These biologists were asked to evaluate whether, given a GO term, it would be biologically or experimentally useful to know that a gene was annotated to the term. This provided a way to prune down the many thousand terms in GO relevant to S. cerevisiae without resorting to non-biological measures such as the minimum or maximum depth of a term (which fall prey to multiple paths to root) or the number of annotated genes (which can be confused by particularly well or poorly studied biological processes).
We initially required any GO term of interest to receive at least four positive votes from this panel. This defined a set of several thousand "interesting" terms. We added to this set every descendant of an "interesting" term not already included; if a particular function is informative, a more specific function should also be of interest. Finally, we selected only the "uppermost edge" of this set of terms, i.e. those terms possessing a path to root on which no other "interesting" terms fell.
This resulted in the set of 200 biological process terms that we used to generate our global answer set. When evaluating individual functions, any terms with fewer than ten gene pairs (taking into account both the number of annotated genes and the microarray data available for those genes) were discarded.
In summary, we:
Each function resulting from this process was used to train one copy of MEFIT's Bayesian network. The function set as a whole defined MEFIT's global gold standard related/unrelated training gene pairs, and individual functions each defined a function-specific subset of this standard for training and evaluating individual Bayesian networks. For more details, please see Huttenhower et al. 2006. Also note that you can download the final set of Gene Ontology terms in our Download area.
We'd like to extend our thanks to several groups and software packages that have helped to make MEFIT possible:
Please feel free to contact MEFIT's primary author (Curtis Huttenhower) or principle investigator (Olga Troyanskaya) either directly or through our lab web page.
TSR's Dungeons and Dragons defines a mephit as an elemental
creature similar to an imp; like imps, they tend to be small,
winged, horned, tailed, and unpleasant. Unlike imps, however,
mephits are endowed with a particular elemental brand (fire, ice,
and so forth), the essence of which they can breathe forth at will
to discourage would-be attackers. The name is thus thought to have
been derived from the Latin "mephitis" meaning "stench", most
commonly appearing today in the family (Mephitidae), genus
(Mephitis), and species (mephitis) of the striped skunk.
MEFIT came about in an attempt to find an appropriately themed name containing the letters M (microarray), F (function), and I (integration). Without the inconvenient "ph" in the middle, mephit was a perfect fit, leading to our Microarray Experiment Functional Integration Technology. Any references to unpleasant smells are thus purely coincidental. The logo appearing to the left was derived from the fine work of Jon and Ian Brumby by way of Google image search.