MEFIT: Microarray Experiment Functional Integration Technique Curtis Huttenhower, Matt Hibbs, Chad Myers, Olga Troyanskaya Department of Computer Science, Princeton University Bioinformatics, 2006 Contact: chuttenh@princeton.edu, ogt@princeton.edu ----------- Quick Start ----------- For those of you just wanting to run MEFIT and see some results, try something like the following: MEFIT -r related/ -u unrelated.txt -O global.xdsl -o learned/ -p predictions/ -t trusts.txt data_set_one.pcl data_set_two.pcl The arguments are as follows: -r related/ A directory containing one or more files of related genes, e.g. lists of genes in several GO terms with one file per term. -u unrelated.txt A file containing one or more pairs of unrelated genes, one pair per line. -O global.xdsl The XDSL file to contain the generated global Bayesian network. -o learned/ A directory to contain the XDSL files generated for each of the per-function Bayesian networks. This directory must be created prior to running MEFIT! -p predictions/ A directory to contain the predicted probabilities of functional relationships, one file per function. This directory must be created prior to running MEFIT! -t trusts.txt A file to contain a table of trust scores indicating how predictive each data set was found to be of each function. data_set_one.pcl data_set_two.pcl One or more PCL files each representing a microarray data set. PCL files are assumed to have two header columns (in addition to the gene ID column), although this can be modified from the command line; see below for details. ------------ Introduction ------------ Thanks for downloading MEFIT! The MEFIT binary provides a basic implementation of the MEFIT system for microarray data set integration as described in the publication listed above. If you use MEFIT for your research, please cite us! And speaking of research, please keep in mind that MEFIT is (very) research code; its only mediocrely documented (you're reading it) and undoubtedly contains bugs galore, so feel free to contact us with bug reports or puzzled questions. With that out of the way, recall that MEFIT is a pipeline that consumes three inputs: 1. One or more sets of known related genes (e.g. GO terms, pathways, complexes, etc.) referred to as _functions_. 2. One set of known unrelated gene pairs (derived however you'd like). 3. One or more microarray data sets (stored as PCL files). From these inputs, MEFIT produces a variety of outputs: 1. One naive Bayesian network representing the global relationships between correlations in individual data sets and gene pair functional relationships. 2. One naive Bayesian network _per input function_ representing the relationships between data sets and functional relationships specifically within that biological function. 3. One prediction set _per input function_ of the probabilities of gene pair functional relationships for each pair in that function. 4. One set of "trust" scores indicating how informative each data set is within each biological function. These are calculated as the average difference in probability of functional relationship given data from a data set within a function. This can be a lot of output, and MEFIT can consume a substantial amount of time and/or memory while generating it - processing the 40 data sets (~750 conditions) discussed in our publication took about a day on a standard desktop computer and consumed about 2G of memory in the process. Caveat researcher! ------------ Command Line ------------ If executed with no arguments or with the -h flag, MEFIT provides the following command line help: Usage: MEFIT [OPTIONS]... [FILES]... -h, --help Print help and exit -V, --version Print version and exit Inputs: -r, --related=directory Directory containing lists of known related genes -u, --unrelated=filename List of known unrelated gene pairs Outputs: -o, --output=directory Directory to contain learned per-function Bayesian networks -O, --global=filename Global learned Bayesian network -p, --predictions=directory Directory to contain predicted probabilities of functional relationship -t, --trusts=filename Trust scores learned per data set and function Optional: -b, --bins=filename Tab separated bin cutoffs -d, --distance=STRING Distance measure (possible values="pearson", "euclidean", "kendalls", "kolm-smir", "spearman", "pearnorm" default='pearnorm') -g, --genes=filename Subset of genes to include in evaluation -G, --genex=filename Subset of genes to exclude from evaluation -R, --random=INT Seed random generator (default=0) -s, --skip=INT Additional columns to skip in input PCLs (default=2) -v, --verbosity=INT Message verbosity (default=5) -x, --xdsl Output .xdsl files in place of .dsls (default=off) -z, --zero Zero missing values (default=off) -c, --cutoff Include only confidences above cutoff (default=0) A description of each argument follows: * [FILES] You must provide one or more PCL files as unflagged arguments to MEFIT. See below for a detailed description of the PCL file format. * -r, --related=directory This is how you tell MEFIT what biological functions you're interested in. To MEFIT, a function is a list of related genes stored in a file, one gene per line (the GENE LIST file format; see below). The directory you pass to the -r flag should contain only function files; the name of each file will become the name of the function in MEFIT's output (e.g. a file named ribosome_biogenesis.txt will represent the ribosome_biogenesis function). * -u, --unrelated=filename Unrelated genes come in pairs, and the -u flag accepts an input file containing one unrelated gene pair per line. Note that this is a single file, not a directory! See the GENE PAIR file format below. * -o, --output=directory The -o flag takes a directory as its argument, which MUST exist before you execute MEFIT! This directory will be filled with (X)DSL files representing the Bayesian networks learned for each biological function of interest. These will all have the same structure with different parameters (i.e. conditional probability tables). * -O, --global=filename The -O flag takes a filename as its argument, which should end in either .dsl or .xdsl, and which represents the global Bayesian network containing probabilities representative of the entire set of input data (not separated per function). * -p, --predictions=directory The -p flag takes a directory as its argument, which MUST exist before you execute MEFIT! This directory will be filled with text files containing pairs of genes and their predicted probabilities of being functionally related, one pair (and probability) per line. See the SCORED GENE PAIR file format below. * -t, --trusts=filename The -t flag takes a filename as its argument and outputs into this file a table indicating how predictive each microarray data set was found to be of each biological function. These "trust" scores are calculated as the average absolute difference in posterior probability given each possible input value from a particular data set within a particular function. * -b, --bins=filename The optional -b flag takes a filename as its argument and, if given, reads from that file a list of quantization cutoffs used for transforming continuous z-scores (calculated from microarray correlation values) into discrete values for Bayesian inference. The number of bins used per conditional probability table will be equal to the number of quantization cutoffs; for example, the default values are -1, 0, 1, 2, 3, creating a Bayesian network in which each data set's correlations are binned into values less than -1, between -1 and 0, between 0 and 1, between 1 and 2, and above 2 (the last value is ignored). See the QUANTIZATION file format below. * -d, --distance=STRING The optional -d flag takes a named similarity metric as its argument (defaulting to "pearnorm", Pearson correlations normalized using Fisher's z-transform). The given similarity metric will be used for calculating pairwise scores between genes in each microarray data set. * -g, --genes=filename The optional -g flag takes a filename as its argument and, if given, forces MEFIT to operate only on pairs consisting of genes from this list. See the GENE LIST file format below. * -G, --genex=filename The optional -g flag takes a filename as its argument and, if given, forces MEFIT to operate only on pairs not containing any genes from this list. See the GENE LIST file format below. * -R, --random=INT The optional -R flag takes an integer as its argument and, if given, seeds the random number generator with this value. If -R is -1, the random number generator will be seeded pseudorandomly using the current time (tick count). * -s, --skip=INT The optional -s flag takes an integer as its argument and, if given, skips that many columns _in addition to the initial gene ID column_ in the input PCL files. For example, if your PCL files have three columns preceding the condition names (e.g. GID, NAME, and GWEIGHT), -s should be 2 (the default). If your PCL files contain only a single gene ID column, -s should be 0. See the PCL file format below. * -v, --verbosity=INT The optional -v flag takes an integer as its argument and, if given, increases or decreases the amount of trace output generated during MEFIT's execution. Making it too high may produce large globs of output! * -x, --xdsl The optional -x flag, if present, generates DSL files in the learned output (-o) directory rather than XDSL files (the default). * -z, --zero The optional -z flag, if present, zeros missing values during Bayesian learning rather than ignoring them. This is NOT recommended for general MEFIT usage, but could conceivably be useful if you have some sort of very odd data with very little disagreement between microarray data sets. * -c, --cutoff The optional -c flag, if present, suppresses all prediction outputs less than the given cutoff. This can be useful is MEFIT is producing gargantuan text files for each of your functions of interest; turning the cutoff up to even a fairly small nonzero value can greatly reduce your output (since most gene pairs are not expected to be functionally related). ------------ File Formats ------------ * PCL MEFIT expects PCL files in the standard tab delimited format. Each PCL file should contain at least one initial column listing gene identifiers and exactly one initial row listing column and condition identifiers. If present, an EWEIGHT row will be ignored; any other row is considered to be a gene record. If additional initial columns are present (e.g. NAME, GWEIGHT, etc.) they should be skipped using the -s command line argument (see above). PCL files with missing values should be imputed and/or filtered before processing by MEFIT, although missing values will not (in most cases) prevent MEFIT from doing something useful with the data (it'll just be more accurate if you impute things first). For example: ========== GID NAME GWEIGHT T1 T2 T3 EWEIGHT 1 1 1 YFL039C ACT1 1 1.0 -0.5 0.33 YDR211W GCD6 1 -0.05 2.11 0.09 YJL005W CYR1 1 0.0 -0.5 -1.5 ========== * GENE LIST A GENE LIST file is a text file containing one gene identifier per line. For example: ========== YFL039C YDR211W YJL005W ========== * GENE PAIR A GENE PAIR file is a text file containing one pair of gene identifiers per line, separated by a tab. For example: ========== YFL039C YDR211W YFL039C YJL005W YDR211W YJL005W ========== * SCORED GENE PAIR A SCORED GENE PAIR file is a text file containing one pair of gene identifiers followed by a numerical score per line, all separated by tabs. For example: ========== YFL039C YDR211W 0.1 YFL039C YJL005W 0.5 YDR211W YJL005W 0.9 ========== * QUANTIZATION A QUANTIZATION file is a text file containing one line consisting of numerical values separated by tabs. For example: ========== -1 0 1 2 3 ========== * DSL/XDSL The DSL and XDSL file formats are text based storage formats for Bayesian networks defined by the University of Pittsburgh Decision Systems Laboratory's SMILE library, which MEFIT uses for all of its Bayesian inference. For more information, see: http://genie.sis.pitt.edu --------------- Version History --------------- * 1.2 08-14-07 Addition of the -c flag and a bug fix in the negative pair loading code, both thanks to Jim Costello at Indiana University! * 1.1 03-01-07 Minor refresh to pick up changes in the underlying libraries; no major changes. * 1.0 09-01-06 Initial release.