GXNA Quick Start

Following is an informal guide to the software; read the paper for the theoretical framework.

INPUT. GXNA requires the following inputs: an expression and a phenotype file, an interaction graph, and an annotation file for the microarray you are using. These are all normal text files using simple formats. You need to generate the expression and phenotype files, since these depend on your data. We provide the interaction graph and annotation files for several arrays, though you can also make your own.

We recommend the extensions .exp, .phe, .gra and .ann for each type of file. They are described in detail below; the download package contains examples for each of them. For most microarray packages (or if you are using Excel), it should be easy to create the files you need using cut-and-paste, or export features. If you are using the R package Bioconductor, read these instructions.

Expression data. Each line of the file corresponds to a probe on the microarray. The first word of each line is the probe ID, followed by numbers giving expression levels for each array, separated by spaces. The usual preprocessing steps (normalization etc.) should be done before running GXNA; the better the quality of the data, the better the results.

In general we recommend that the expression values be on a log-scale. Most packages will do this for you (in R/Bioconductor, these are the so called M-values), but if they do not, make sure you take log (base 2) before feeding the data to GXNA. This way differences in means correspond to gene expression fold changes. The algorithm works best if the data is approximately normal, and the log-transform usually helps in this regard.

If there are multiple probes for a gene, their expression values are averaged. Missing values are not yet supported; converting all of them to 0 is coarse but works. We hope to fix this in the near future.

Phenotype data. This is simply a one-line file of space-separated phenotypes. The phenotypes can be any string, for example 0 for cases and 1 for controls, or B, CD4 and CD8 for different cell types.

The number of phenotypes determines the comparison being performed. If there are only two phenotypes, genes are scored using T-statistics with unequal variances. If there are more than two phenotypes, F-statistics (ANOVA) are used.

The interaction graph. We provide interaction graphs for human and mouse, based on information from the databases Entrez Gene and KEGG. Each line in the graph file defines an edge; the first two words are the Entrez IDs for the interacting genes. These are optionally followed by the interaction type and source (e.g. which KEGG pathway it comes from); currently these are only used when drawing the output graphs.

By default, GXNA will use the human.gra interaction graph; to use the mouse, use the command line option -edgeFile mouse.gra.

The array annotation file. Since the expression values are indexed by probes, while the interaction network is indexed by gene Entrez IDs, GXNA needs a way to map one to the other. This is done by the array annotation file; each line contains a probe ID, followed by an Entrez ID (or NA if not available). Most manufacturers provide this information; so does Bioconductor.

GXNA release 2.0 comes with the following annotation files:

hgu133.ann     Affy Human 133 (works also for 133A and 133B)
hgu95av2.ann   Affy Human U95 (older Affy array)
hu6800.ann     Affy Human 6800 (really old Affy array)
whg.ann        Agilent Whole Human Genome
wmg.ann        Agilent Whole Mouse Genome
human1av2.ann  Agilent Human 1A Version 2 (older Agilent array)
We have done a fair amount of testing, but let us know if you encounter problems using them.

RUNNING GXNA. For starters, let us look at a standard, single-gene analysis:

 ./gxna -name sim -mapFile human1av2.ann

GXNA is currently command-line based. The -name option tells it to use the sim.exp and sim.phe expression and phenotype files, and the -mapFile option tells it to use the Agilent Human 1A V2 array annotation file. These are the only two options that are always required.

The program performs the computations and generates several output files; the most user friendly is the HTML one. Point your browser to the file

 sim_000.html 
The browser window is split in two. On the left panel, you have a list of gene names and Entrez IDs, ranked in order of statistical significance according to their adjusted p-value. The gene score depends on the comparison being performed; in this case, it is the t-statistic. You can click on each of the top results to get more information in the right panel (see below).

NETWORK ANALYSIS. Instead of single genes, we can do an adapted search for differentially expressed networks. The command is:

 ./gxna -name sim -mapFile human1av2.ann -algoType 1 -version 001
The -version option is used to keep different runs separate; the output files are indexed by name and version. Point the browser to the file
 sim_001.html 
(or simply refresh the previous browser window). Now the left panel contains networks, sorted again by their significance. Each network is indexed by the name and Entrez id of its root (note that the root need not be the most important gene in the network). They all have size 15, which is the default size that GXNA searches for. For the top networks, click on their rank to get more information on the right panel. This displays, for each gene in the network, its name and ID, the number of probes that map to it, its fold change, standard deviation (across all samples), and score (in this case, the t-statistic).

You will notice that there are gaps in the ranks of the top networks. This is because some of these have many genes in common, and by default GXNA does not display networks that are more than 75% identical to higher scoring networks. This is a parameter that can be adjusted.

GRAPHICS. GXNA also outputs the top most significant networks (by default, the top 25) in DOT format, one file for each graph. You can use the free Graphviz software (pre-installed on many Unix platforms; you will probably need to install it yourself on Windows) and use those files to visualize the networks. To see if Graphviz is installed, try

  dotty sim_001_0.dot
This should produce this kind of picture. Every node contains the gene name and its fold-change.

If Graphviz is installed, GXNA can produce the graph drawings automatically. Try

 ./gxna -name sim -mapFile human1av2.ann -algoType 1 -runDOT 1 -version 002
Now if you look at sim_002.html in your browser, the right panel displays a drawing of each network, in SVG format. Edges are marked according to the interaction type and direction. The interaction types come from KEGG, as follows:
E enzymatic
T transcription
    T+ activation
    T- inhibition
B protein to protein binding
    Bc  compound
    B+  activation
    B-  inhibition
    Bi  indirect effect
    Bs  state change
p+ phosphorylation
p- dephosphorylation
m methylation
u ubiquitination
g glycosylation
none: missing type information, probably protein-to-protein interaction
To clean up your directory, you can now erase all files whose names start with sim_00. To prevent output files from clogging the directory, use the -outputDir option (see below).

MORE DETAILS ON OUTPUT. GXNA writes several output files; in our example, their names all start with sim_000. In general, they will start with name_version, where the name is the one you provide, and the default version is 000 (can be changed). The default output directory is the current directory, but we recommend you make a directory just for output (called e.g. results) to keep things tidy.

The results file. This is named name_version.res, and has one line per network. By default, every single gene in the graph is considered as a root, and a network is generated for each root. The columns are rank, root name and id; followed by the ids of all genes in the network; followed by network score, raw and adjusted p-values.

The html files. The main one is name_version.html; it points to the HTML frame name_version_frame1.html. The format of the two panels is described above. The right panel displays graph images or text descriptions, depending on the value of the -runDOT argument.

The network files. For each of the top networks (by default, the top 25), GXNA creates files name_version_n.txt and name_version_n.dot, where n is the rank of the networks. The text file contains information for each gene in the network, as described above; the dot file describes its graph structure in DOT format. If -runDOT is used, GXNA uses the neato program in the Graphviz package to also generate SVG images of each network.

The scores. When two phenotypes are compared, the graph DOT files and SVG drawings show gene fold changes. When multiple phenotypes are compared, they show gene F-statistics (converted to normal z-scores), since pairwise fold change is less relevant. The TXT files always show gene fold change (for multiple phenotypes, this is relative to the first two phenotypes in the .phe file) AND gene score, which is T (raw) OR F (converted to z), depending on the number of phenotypes.

The argument file. Names and values for all command line arguments are written to name_version.arg; this way you can keep track of different runs.

COMMAND LINE ARGUMENTS. Here is a list of the most useful arguments. They divide roughly in input, algorithm, output, and advanced parameters. Each argument must be followed by its value, either a string or a number. For booleans, use 1 for true and 0 for false.

INPUT

-name specifies the name for input and output files e.g. -name cells will tell the program to read phenotypes from cells.phe and expression data from cells.exp.

-version (default: 000) specifies the name for output files; this way you can run GXNA several times with different arguments without overwriting the output files

-mapFile specifies the name for the annotation file that maps probe IDs to Entrez IDs. We chose not to have this set by -name, since the file is specific to the microarray platform, not to the experiment.

ALGORITHM

-algoType determines the type of analysis to be performed. The default value of 0 does single-gene analysis; a value of 1 does an adapted search for gene clusters.

-depth controls the size of the clusters in the adapted search. By default, clusters of 15 genes are sought. We recommend trying several sizes e.g. 5, 15 and 25.

-flexSize is a boolean flag; if set to 1 (true), it allows for clusters of variable size. By default, it is set to 0 (false) meaning all clusters must be of the size fixed by depth (unless there are not enough genes in the component of the graph being parsed). It currently works only for two phenotypes.

-nPerms sets the number of permutations used to compute p-values. The default is 100, which is fast but not very precise. We suggest using 100 for exploratory analyses, and 1000 or 10000 for getting definitive estimates.

OUTPUT

-runDOT if true, GXNA creates SVG graph drawings (make sure Graphviz is installed) and the HTML output links to those; if false, the HTML output links to the TXT files

-outputDir sets the directory for the output files; the default is . (the current directory)

-graphCount determines how many top networks get TXT, DOT and SVG descriptions; the default is 25

-maxOverlap determines which networks are shown in the HTML output and get TXT, DOT and SVG files; the default value of 0.75 means that once a network is included, further networks that are more than 75% identical are skipped. Set to 1 if you want the HTML to include everything; this will make sure you will not lose information, but is likely to produce many redundant networks

-maxRows determines how many total networks are in the HTML output table; the default is 250

ADVANCED

-iterations, -kMet. These parameters control the randomized search that follows the greedy search. By default iterations is set to 0, meaning no randomized search. You can set it to 100 or 1000 to gain networks with higher scores at the cost of longer runtimes. kMet can be used to fine-tune the randomized search, though its default value of 1.0 should be adequate

-shrinkDF, shrinkVar. These parameters allow for the use of moderated T- and F- statistics (Smyth 2004). We are working on having them computed automatically, but for now you have to supply them yourself: shrinkDF is the number of degrees of freedom that determines to what extend gene variances are shrunk towards the common value shrinkVar. The default value shrinkDF = 0 means no shrinkage.

CHANGING DEFAULT ARGUMENTS. Using command line options can get tedious, for example if you always use the same array and have to type -mapFile each time. To address this, each time GXNA runs it reads a file named defaults.txt before processing command line arguments. Each line in this file is the name of an argument, followed by a new default value. For example, you could try:

runDOT 1
outputDir results
algoType 1
depth 20
to always search for graphs of size 20, write the results to the "results" directory, and run Graphviz automatically. You can still use command line options to change these values.

QUESTIONS? FEEDBACK? Email serban at stat dot stanford dot edu.

Last update April 2, 2008.