docs.javahelp.manual.genetics.genetics.html Maven / Gradle / Ivy
GeneticSimulator
Simulating Genetic Regulatory Networks
Version
1.2
Richard Scheines and Joseph Ramsey
- Specifying
the Graph
- Specifying
the Parametric Model
- Specifying
the Instantiated Model
- Initializing
the Cells
- Aggregation
and Measurement Error
- Running
a Simulation
To simulate data in Tetrad 4, you must specify a Lag
graph and then interpret it as a parametric model. There are 4 choices of
parametric models,
where in each either i) the value of each variable at the current time is a
function of its causal parents and an error term, or ii) the probability distribution
of the variable is a function of its causal parents.
Understanding the General Model
Consider a discrete time series involving N individuals,
each with M variables G1 through Gm. For example, Figure 11 represents a time
series for three variables, each measured at 5 times.
Understanding Time Series Graphs
If we assume that the process modelled is stable over
time, then we can represent the causal structure of the series with a time
series graph that includes the smallest fragment of the series that repeats.
The number of temporal slices in the time series graph is the longest lag of
direct influence plus one. For example, the time series graph in Figure 33,
which represents the series in Figure 11, needs three temporal slices to represent
a repeating sequence, because G2 has a direct effect on G3 with a temporal lag
of two.
In order to simulate data from this class of models,
we need to express each variable (or its probability distribution) at an arbitrary
"current time" as a function of its direct causes. To represent only the set
of direct causes for each variable, we construct a Lag
graph. For example, the Lag
graph in Figure 55 expresses the causal information needed to simulate the
time series represented in Figure 11 and Figure 33. The Lag
graph consists of the 3 variables repeated in temporal slices from lag =
mlag (most remote direct influence) to lag=0 (current time), but includes only
edges that are into variables at lag = 0, that is, it includes only edges
that represent direct influences into the current time.
Note: We connect all pairs of variables at the
beginning of the repeating sequence with a double-headed arrow to represent
an unconstrained causal connection, so that m-separation applied to this graph
does not entail that these variables are independent, contrary to the fact that
later versions of the same variable are causally connected in the time series
graph
The time lag i in a Lag
graph is indexed by ":Li". Thus, variable G1 at the current time (lag of
0) in an Lag graph is G1:L0, and the same variable two time slices in the past
G1:L2. The constant mlag is the maximum lag of direct influence.
Undertanding the Data Generated
To simulate data in Tetrad 4, you must specify a Lag
graph and then interpret it as a parametric model. There are 4 choices of
parametric models, where in each either i) the value of each variable at the
current time is a function of its causal parents and an error term, or ii) the
probability distribution of the variable is a function of its causal parents.
Given an instantiation of the chosen parametric model, and the assumption that
the causal relations remain constant over time for each individual, values for
N individuals (cells) over T times can be simulated to produce a data cube that
is:
N (individuals) x M (variables) x T (times).
Although the cube has three dimensions, we will store it in the standard two,
by repeating M columns for each time slice, as so:
The simulation to produce these data is run in two stages:
an "initialization"
phase and an "update" phase. The initialization phase must assign values
for each individual for at least as many times as the maximum
lag in the time series model, i.e, from time t = 1 until time t = mlag.
After time � mlag, values can be assigned to variables for a given individual
using the given instantiation of the parametric model that interprets the Lag
graph.
Data collection regimes for protein expression involve tissues that contain
thousands of cells, not all of which behave strictly identically and not all
of which can be measured in isolation.
Since current technology cannot perfectly measure the levels, even relatively,
of gene (or mRNA) expression, or levels of protein synthesis in cells, we also
allow the user to specify a measurement model that aggregates cells by dish
and models measurement error.
Using Tetrad: Setting Up the Program
This section will explain how to specify the graph, choose a parametric model
to interpret the graph, instantiate the parameters of the model, pick an initialization
routine, and finally specify the measurement model.
When the program opens, it will display the empty workbench.
In order to generate data, you need to specify a graph, a parametric
model (PM), an instantiation
of this model (IM), and connect them all to a Data modelNode. These objects can
be deposited on the workbench by clicking on their icon in the left and clicking
where you want them located on the workbench, and then by clicking on the "Flow-Charter"
icon on the left and then dragging from one modelNode to another to connect them.
The skeleton for a simulation looks like Figure 9.
After double-clicking on the graph modelNode - you will be
given a choice (Figure 11) of whether you want to specify a regular graph or
a Lag graph.
You will then be prompted for whether you want to specify
the graph manually or randomly, with manually the default.
2.1 Manual Graph Specification
- User Specifies M, the number of variables
(range: 1 to 500, default = 5)
- User Specifies mlag
(range: 1 to 5, default = 1)
- A default Lag graph is drawn for user in the graph
drawing window with an arrow from each Gi:L1 to Gi:L0, e.g., with defaults:
mlag = 1 and M = 5 (Figure 13)
- User completes the Lag
graph manually, that is, they add edges from variables at lag > 0 into
variables at lag = 0. No other edges are allowed in this representation of
the series.
2.2 Random Graph Specification
If you choose Random Graph Specification, you will be
confronted with the dialogue box in Figure 15, which asks you to set four parameters.
- ) User Specifies M, the number of variables in each
individual
(range: 1 to 500, default = 5)
- User Specifies mlag
(range: 1 to 5, default = 1)
- User chooses and sets the value of exactly one of:
a) constant indegree ci (range: 0 to M*mlag, default = 2)
b) max indegree mxi (range: 0 to M*mlag, default = 2)
c) mean indegree mni (range: 0 to M, default = 2)
a) constant indegree - choose parents(Gi) by putting a uniform distribution
over the possible parents of Gi (that is, all variables earlier in time) and
drawing without replacement until |parents(Gi)| = ci
b) max indegree - for each variable in the set of possible parents of Gi ,include
it in parents(Gi) if a random draw from a uniform [0,1] is greater than cutoff
= 1/|possible parents of Gi|, until either the possible parents of Gi are
exhaused, or |parents(Gi)| = mxi whichever is first.
c) mean indegree - for each variable Gi, decide for each variable in the set
of possible parents of Gi to include it in parents(Gi) if a random draw from
a uniform [0,1] is greater than cutoff = 1/|possible parents of Gi|, until
the possible parents of Gi are exhaused.
- User chooses the approximate percent of unregulated
genes. A gene is "unregulated" if its value at a time does not depend on any
other gene besides itself at a previous time.
(range: 0-100), default = 10)
2.3 Output and Storage
If the graph involves a managable number of vertices
(<30), then the graph editor represents it pictorially, allowing you to edit
the graph by adding or taking away edges. For example, in Figure 17 we show
a Lag graph with five genes and a maxlag = 2 after adding a few other regulating
connections.
Figure 17: Lag Graph with 5 Genes and Mlag=2
If the number of genes is large, e.g. 5,000, then representing the Lag
graph pictorially is a waste. In that case we change to a textual representation
in which each gene is given a row, and the genes that regulate (cause) it are
listed on its row. For example, Figure 19 shows the top 20 or so lines of a
Lag graph
with 500 variables. In this representation, Gene 5 (G005) is regulated by itself
one time step back (G005:1) and by Gene 461 two time steps back (G461:2). This
representation can be saved as a text file to disk and then processed by software
of your choice.
Figure 19: Large Lag Graph Represented Textually
Specifying the Parametric Model
Having specified the Lag
graph, you must now intrepret the graph as a parametric model. Currently,
only one type of parametric model is implemented-- the "Boolean Glass Gene
Parametric Model." When you double-click the PM modelNode at this, point, that
type of parametric model will automatically be chosen for you.
Eventually, four types of parametric models will be implemented, as follows:
- Boolean
Glass Gene Parametric Model
- General
Glass Gene Parametric Model
- General
Additive Model Gene Parametric Model
- Bayes
Net Standard Error Model
These families are not exclusive, but each has advantages.
3.1 Glass Updating
- Implemented.
In both Glass (Edwards & Glass, 2000) and General Updating
parametric models, the value of each gene Gi at the current time (a lag of 0)
Gi:L0 is set by a function with the following general form:
where: is an
"error" term drawn from a given probability distribution (discussed below),
all variables are continuous, and Fi is a function specified by the user. The
difference between Glass and General updating models involves only the form
of the functions Fi.
3.1.1 Preliminary Binary Projection
Even though the variables in this parametric model class are continuous, Glass
functions take boolean valued inputs, so some pre-processing is necessary. The
inputs to the function Fi are the members of the set P = {parents(Gi:L0)/Gi:L1},
that is, the parents of Gi:L0 except for Gi:L1, which is already in the updating
function. The idea behind Glass functions is to simplify the input to whether
a given gene is expressing at "high" or "low" levels. We map each Gp � P to
0 (low, that is, below its average expression level of 0) or 1 (high, that is,
above 0). We do this with the following binary projection BP:
Simulated values in for variables Gi will range mostly over the interval [-2.0,
2.0] and will oscillate above and below 0.0. Since raw microarray data is typically
given as a ratio of microarray spot brightness to average spot intensity, where
this ratio typically ranges from near 0.0 up to about 10.0, with averages around
1.0, it is useful to think of the simulated data as loglinear with respect to
raw data-viz., log(x) of each intensity ratio x is recorded in the simulation
in place of the intensity ratio x.
Note: We thank Stuart Kauffmann for pointing us to the Glass model.
3.1.2 The Function Table for Fi
Since genetic regulators either inhibit or activate their targets, Edwards
and Glass (2000, p.3) set the range of Fi to {-1,1}. To specify a particular
Fi, construct a truth-table with 1 column for each
and one for the output of Fi. Thus the truth table will have 2 to the power
of P rows. For example, if P = {G3:L1, G5:L2}, then the function table for Fi
is:
Filling in the Fi column in this function table specifies
the particular instantiation of Fi used in the update function for variable
Gi, thus this step is left to the Instantiated Model specification below.
By picking the Glass Updating parametric family, you are committing yourself
to the class of models parametrized by the update function above and Glass functions
for the Fi.
3.2 General
Updating - Unimplemented.
Again, in both Glass and General Updating parametric models, the current value
of each variable Gi:0 is set by a function with the following general form.
where: is
an "error" term drawn from a given probability distribution (discussed below),
all variables are continuous, and Fi is a function specified by the user. The
difference between Glass and General updating models involves only the form
of the functions Fi. In General Updating parametric models, the user is free
to specify any function for the Fi, not just Glass functions. Here we will use
Tianjiao's GAM specifier to allow the user to specify, for each i, the function
Fi.
3.3 GRN
GAM - Unimplemented.
In this parametric family, the variables are continuous and the update function
is slightly more general than the ones used in 3.1 and 3.2.
GAM stands for general additive model, so the only constraint on this parametric
model is that the Gi are additive functions. The particular instantiaions of
Fi for each i are fixed when you specify the instantiated model (IM). Here we
will use Tianjiao's GAM specifier to allow the user to specify, for each i,
the function Gi.
3.4 BN SEM
- Unimplemented.
In this parametric family, the variables are discrete and the causal system
is just interpreted as a standard discrete Bayes Network, that is, the update
function is just a conditional probability table. <br>
At the parametric level, you need to specify the number of categories for each
variable, as well as the value of each category. This is identical to the Bayes
Net parametric model in general Tetrad 4 models.
Specifying the Instantiated Model
Once the parametric model has been specified, you need only specfiy values for
the parameters in the corresponding instantiated
model (IM). To do this, double-click on the IM modelNode in the session workbench.
Close the IM editor to finish. Since the parameters depend on the PM chosen,
we cover each of the above classes in turn.
4.1 Glass Updating
Recall that the update
function used in Glass model is:
The free parameters for a Glass model are thus:
- the rate parameters rate1 and rate 2
- the distributions over the error terms Ei
- the Boolean functions F1
The error distributions are all assumed to be pairwise indedpendent,
and normal with mean 0 and standard deviation SD(Ei).
Parameter Defaults
Rate 1 = .1
Rate 2 = .5
Ei ~ N(0,.05)
By the structure of the Glass function, an unregulated
gene (no other genes besides itself affecting it) will have an equilbirum value
of 0. Rate 1 determines how quickly such genes return to equilbrium. If Rate
1 is set = .1, for example (its default), then in each time step a gene with
no regulators will return 1/10th of the way toward 0, whatever its previous
value, before noise. Rate2 governs the influence of other regulators. The Glass
functions Fi output +1 (promoting expression) or -1 (inhibiting expression).
If Rate 2 is set = .5, then in one time step the contribution to a gene is either
+.5 or -.5.
The error terms are meant to simulate random biological error in the process
of gene expression. We give a default of mean 0 and standard deviation = .05.
The range of Fi is {-1,1}. The number of possible functions for each Fi is ,
where P is the set of parents of Gi. For example, if P = {G3:L1, G5:L2}, then
the uninstantiated function table for Fi is:
which can be filled out in 16 different ways. If P contains 5 parents, then
there are 32 rows in Fi and there are 2^32 different ways you can instantiate
Fi. Thus, we give you a choice of filling in all Fi randomly or manually, with
randomly the default.
Randomly:
- Draw Fi from: (p(Fi = 1+) = .5, and p(Fi = -1) = .5,
but
- Ensure all inputs to Fi are non-trivial, that is,
for each input Pk, insist that at least one pair of rows in the function table
exist that are
i) identical on all inputs besides Pk,
ii) different on Pk, and different on the output Fi.
Manually: Allow the user to assign the value of Fi in each row manually.
The window that appears in Tetrad 4 in which you set the parameters for a Boolean
Glass Function IM is shown in Figure 23. The defaults for the rate parameters
appear on the bottom of the screen. The opening view is of the Boolean function
for gene G1, but any gene's function can be selected. In , the function regulating
G1 promotes G1 in all cases except when both G2 one time lag back and G3 one time
lag back are both expressing above their mean, in which case G1 is inhibited.
Boolean functions can be manually edited by clicking in the appropriate Fcn. Value
box and replacing the value with +1 or -1.
The standard deviation of the error distributions can be changed by selecting
the "Errors" tab, and then editing the default of .05 to whatever value you prefer.
5. Initializing the Cells
Given the full parametric model and its instantiation, and assuming that the
same model is used for each individual cell, then values for N individuals over
T times can be simulated in two stages: an "initialization" phase and an "update"
phase. The initialization phase must assign values for all the genes in each
individual for time step 1. After this, the update function specified above
can be used to assign values to variables for a given individual, with the caveat
that if mlag > 1, then not all the "parents" of a variable in time 2 will exist.
In that case the update function will simply use the latest value of that variable
that does exist as input.
Biologically, cells from the same tissue can be starved of nutrients, or manipulated
in some other way and all brought into roughly the same state, that is, synchronized.
They can then be shocked, e.g., exposed to nutrients, and let run.
Further, many genes in a cell are considered "housekeeping genes," which express
protein at some basal rate stably over time. In this version of the simulator
we allow the user to choose among two methods for initializing values: synchronized
or random.
5.1 Synchronized Initialization
- For each gene Gi in cell 1,
a. if Gi is a housekeeping gene, that is,
its only parent in the Lag graph is itself, then let Gi:1 = 0,
b. else draw Gi from a standard normal distribution, i.e, N(0,1)
- Duplicate cell 1 N times, where N is the total number
of individual cells.
5.2 Random Initialization
- For each gene Gi in each cell,
a. if Gi is a housekeeping gene, that is,
its only parent in the Lag graph is itself, then let Gi:1 = 0,
b. else draw Gi:1 from a standard normal distribution, i.e, N(0,1)
Aggregation and Measurement Error
6.1 Aggregation by Dish
So far, we have modelled the progression of gene activity over time for individual
cells. We cannot currently obtain biological data on this level, however. In
microarray experiments, thousands (or millions) of cells are grown in each of
several dishes. For example, Figure 26 shows two dishes with a million cells
each.
A microarray experiment involves at least the following sequence of steps to
begin the experiment:
- Separate single cell culture into dishes
- Grow cells in dishes
- Synchronize Cells
- Expose dishes to treatment
Initial Dish to Dish Differences
Although they are intentionally minimized, small variations in nutrient or temperature
make dish to dish variations inevitable, even at the beginning of an experiment.
To simulate these differences, we need to specify how much variation in expression
levels we can expect between dishes. Let that quantity be the standard deviation
of expression level difference due to the dish, which we will call sd(DV),
with default at 10%.
For each dish Dj, we draw Dev(Dj) from a normal distribution with mean 100 and
standard deviation = sd(DV), that is, N(100, sd(DV)).
We then adjust the initial^3 expression levels in all genes in all the cells
in dish Dj by Dev(Dj):
For each dish Dj
For each cell Ck in Dj,
For each gene Gi in Ck at time 1 (Gi:1),
Let Gi = Dev(Dj) % of Gi
Dishes and Time
In the usual studies, although several dishes might begin an experiment in a
relatively "synchronized" state, each dish must be "frozen" before it can be
processed for measurement, and thus we cannot use the same dish to measure cells
at two different time points in the experiment. To measure two time points,
we take a sample from one dish at time 1 and a sample from a different dish
at time 2, e.g., Figure 29.
If the goal is to compare the expression levels of a particular gene at two
different times to see if they substantially differ, then this constraint forces
us to compare the level of a sample from one dish against a sample from another.
This constraint will NOT be imposed by the simulator, but rather by the data
analyst after data is generated and stored.
6.2 Measurement Error
After a dish is "frozen" at a time, its RNA is extracted. From this RNA, several
samples can be drawn and each "measured" by exposing it to a microarray chip
(chips can be re-used) and then digitally converting pixel intensities to gene
expression levels (Figure 31).
Chips can be reused approximately four times. Samples from the same dish might
vary slightly, chips definitely vary, the same chip functions differently from
one use to another, and the pixel digitization routine might include error as
well. Each "measurement" therefore, has five sources of potential error:
- dish,
- sample,
- chip,
- re-use of a chip, and
- pixel digitization
We discussed how we will model the dish variability above. In this version of
the simulator, we will NOT model dish to dish variability beyond initialization.
To model the other sources of measurement error, we will proceed as follows.
After initializing the cells in an experiment, and adding dish to dish variabiltiy,
we will have N cells partitioned into D dishes. After we run the experiment,
in which each cell obeys the same causal laws but updates independently of each
other, each cell has an expression level at each of t times.
After Running the Experiment
We call this the Raw data from the experiment.
After RNA Extraction
Next, we model the RNA extraction process for each dish by aggregating all the
cells in a dish, averaging the expression levels for each gene, and recording
an average expression level for each gene in each dish at each time (Figure
34):
The average expression level for a gene Gi at time t
in a dish d, is:
Adding Measurement Error There remain 4 sources of measurement error:
sample, chip to chip, chip re-use, and pixel digitization error. In this version
of the simulator - we will NOT model chip re-use variability. We treat each
of the other as additive normal error with mean 0 and standard deviation
Let Sample to Sample Variability error
Let Chip to Chip error
Pixel digitization error
For each sample s taken from a dish d and measured, new values of
are drawn. For each gene in each sample, a new value of
is drawn. Thus the average measured expression level for each gene Gi at time
t is:
If we draw 4 samples from each dish, and don't re-use any chips, then our final
data table would look as follows:
We call this the Measurement Data.
To summarize, the parameters the user must specify for the measurement error
model, with defaults in brackets are:
- Dish
to Dish variability = sd(DV) [10]
- Number of samples per dish = S [4]
- Sample to Sample Variability S,
[.025]
- Chip
to Chip Variability c,
[.1]
- Pixel Digitization pd,
[.025]
Running a Simulation
To initiate a data generation process, double click on the red-die on the arrow
from an IM modelNode to a Data modelNode (). The program will only produce data if it
has all the info it needs, otherwise it will prompt you.
Simulation Run Parameters:
After clicking the red die, a dialogue box comes up showing the parameters you
must specify, along with their default values (Figure 38).
We discussed the measurement model parameters in section 6. For the Simulation
Parameters, the number of dishes defaults to 1 and the number of cells per dish
to 100,000. The number of time steps generated defaults at 4, and the user has
the option of when to start storing data (default at time step 1). The user
can also choose how often to store data, with the default every step.
Finally, because there are often thousands of cells and thousands of genes in
each cell, many simulations will involve millions of gene expression levels
at each time. Call this the raw data. The measured, as opposed to raw data,
involves only a few samples from each dish, each of which contains as many measurements
as there are genes. Since storing the raw data is expensive relative to the
aggregated data with measurement error - we give the user a choice as to whether
to store raw data, with the default = no. If the user chooses to store raw data,
both data sets will be contained in the data nodes on the workbench in Tetrad.
- Bayes Network
- definition
- chip
- A micro-array
- chip to chip variability
- definition
- dish to dish variability
- definition
- edge
- definition
- error term
- definition
- Generally Adititve Model (GAM)
- This model outputs an additive function that is not
alwasys linear.
- Initialization
- Synchonized: all copies of orginial
- Random: Different levels of gene expression
- Instantiated Model (IM)
- definition
- Lag Graph
- definition
- Maximum Lag
- definition
- Measurement Data
- definition
- Parametric Model (PM)
- definition
- Pixel Dish
- definition
- Rate Parameters
- definition
- Raw Data
- definition
- Regular Graph
- definition
- Regulation
- definition
- Sample to Sample variability
- definition
- SEM
- definition
- Boolean Glass Gene Parametric
Model
- General Glass Gene Parametric
Model
- General Additive Model Gene Parametric
Model
- Bayes Net Standard Error Model
- definition
- Time Series Graph
- definition
- Update Function
- Defines how influence is transmitted to gene in current
time, can include variables from past times
-
Edwards, R., & Glass
L. (2000). Combinatorial explosion in model gene networks., Chaos, V. 10, N.
3, September., pp. 1-14.