All Downloads are FREE. Search and download functionalities are using the official Maven repository.

resources.javahelp.manual.genetics.genetics_old.html Maven / Gradle / Ivy

There is a newer version: 7.6.6
Show newest version


    Genetic Simulation
    



 Genetic Simulation -Version 1.0-

By: Richard Scheines and Joe Ramsey
Table of Conents

1. The General Model

2. Specifying the Graph

3. Specifying the Parametric Model

4. Specifying the Instantiated Model

5. Initializing the Cells

6. Running a Simulation

7. Aggregation and Measurement Error

8. References

 
1. The General Model
1.1 An Introduction
Consider a discrete time series involving N individuals, each with M variables V1 … Vm. For example, Figure 1 represents a time series for three variables, each measured at 5 times.
   
Figure 1: Causal Graph for 5 Times    
If we assume that the process modelled is stable over time, then we can represent the causal structure of the series with a Repeating Graph that includes the smallest fragment of the series that repeats. The number of temporal slices in the repeating graph is the longest lag of direct influence plus one. For example, the repeating graph in Figure 3*, which represents the series in Figure 1, needs three temporal slices to represent a repeating sequence, because V2 has a direct effect on V3 with a temporal lag of two.

* We connect all pairs of variables at the beginning of the repeating sequence with a double-headed arrow to represent an unconstrained causal connection, so that d-separation applied to this graph does not entail that these variables are independent, contrary to the fact that later versions of the same variable are causally connected in the repeating graph.

   
Figure 2: Repeating Graph    
In order to simulate data from this class of models, we need to express each variable (or its probability distribution) at an arbitrary "current time" as a function of its direct causes. To represent the set of direct causes for each variable, we construct an update graph. For example, the update graph in Figure 5 expresses the causal information needed to simulate the time series represented in Figure 1 and Figure 3. The update graph consists of the M variables repeated in temporal slices from lag = mlag (most remote direct influence) to lag=0 (current time), but includes only edges that are into variables at lag = 0, that is, it includes only edges that represent direct influences into the current time.
   
Figure 3: Update Graph    
The time lag i in an update graph is indexed by ":Li". Thus, variable V1 at the current time (lag of 0) in an update graph is V1:L0, and the same variable two time slices in the past V1:L2. The constant mlag is the maximum lag of direct influence.
To simulate data in Tetrad 4, you must specify an update graph and then interpret it as a parametric model. There are 4 choices of parametric models, where in each either i) the value of each variable at the current time is a function of its causal parents and an error term, or ii) the probability distribution of the variable is a function of its causal parents. Given an instantiation of the chosen parametric model, and the assumption that the causal relations remain constant over time for each individual, values for N individuals (cells) over T times can be simulated to produce a data cube that is:
N (individuals) x M (variables) x T (times).
Although the cube has three dimensions, we will store it in the standard two, by repeating M columns for each time slice, as so:
Figure 4: Data Cube
The simulation to produce these data is run in two stages: an "initialization" phase and an "update" phase. The initialization phase must assign values for each individual for at least as many times as the maximum lag in the time series model, i.e, from time t = 1 until time t = mlag. After time � mlag, values can be assigned to variables for a given individual using the given instantiation of the parametric model that interprets the update graph.
Data collection regimes for protein expression involve tissues that contain thousands of cells, not all of which behave strictly identically and not all of which can be measured in isolation.
Since current technology cannot perfectly measure the levels, even relatively, of gene (or mRNA) expression, or levels of protein synthesis in cells, we also allow the user to specify a measurement model that aggregates cells by dish and models measurement error.
In what follows, we explain how to specify the graph, choose a parametric model to interpret the graph, instantiate the parameters of the model, pick an initialization routine, and finally specify the measurement model.
 
1.2 Starting the Program
The overall simulation specification begins by loading Tetrad-4, which is a Java application that can be downloaded from: http://www.phil.cmu.edu/tetrad/. The download will write a .jar file to your disk. 1) Make sure version 1.3 or higher of the Java SDK (or JRE) is installed. If it is not - go to http://java.sun.com/j2se/1.3/jre/index.html to download it. 2) On a Windows machine, double clicking the jar file should start the program. If it doesn't, you need to modify the File Type mapping for the .jar extension so that the "open" action is "javaw -jar %1". If you don’t know how to do this, either find someone who does, or take the command line option, which is: 3) In any case, the jar can be run on all machines (Linux included) by typing "java -jar <path-to-filename.jar>" at the command line.
TETRAD 4 on Start-up
Figure 5: Tetrad 4 on Start-up
When the program opens, it will display the empty workbench we show in Figure 8. In order to generate data, you need to specify a graph, a parametric model (PM), an instantiation of this model (IM), and connect them all to a Data modelNode. These objects can be deposited on the workbench by clicking on their icon in the left and clicking where you want them located on the workbench, and then by clicking on the "Flow-Charter" icon on the left and then dragging from one modelNode to another to connect them. The skeleton for a simulation looks like Figure 9.
Simulation Session
Figure 6: Simulation Session
To fully specify each piece of the simulation, you need to double-click on each modelNode, beginning with the graph and moving down to the PM and then to the IM. We cover each in turn.
 
2. Specifying the Graph
After double-clicking on the graph modelNode - you will be given a choice (Figure 11) of whether you want to specify a general graph or an update graph, with the default an update graph.
Specifying a Graph
Figure 7
The Regular graph allows you to specify a Causal Graph interpretable as a Bayes Net or Structural Equation Model. An update graph allows you to specify a specialized structure for genetic simulations. Choose Update graph.
You will then be prompted to specify the graph manually or randomly, with randomly the default.
 
2.1 Manual Graph Specification

1) User Specifies M, the number of variables
(range: 1 to 500, default = 5)

2) User Specifies mlag
(range: 1 to 5, default = 1)

3) Default graph is drawn for user in graph drawing window with arrow from each Vi:L1 to Vi:L0, for mlag = 2 and M = 5

Figure 8

4) User completes graph manually, that is, they add edges from variables at lag > 0 into variables at lag = 0. No other edges are allowed in this representation of the series.

 
2.2 Random Graph Specification

1) User Specifies M, the number of variables in each individual
(range: 1 to 500, default = 5)

2) User Specifies mlag
(range: 1 to 5, default = 1)

3) User chooses and sets the value of exactly one of:
    a) constant indegree ci (range: 0 to M*mlag, default = 2)
    b) max indegree mxi (range: 0 to M*mlag, default = 2)
    c) mean indegree mni (range: 0 to M, default = 2)

a) constant indegree - choose parents(Vi) by putting a uniform distribution over the possible parents of Vi (that is, all variables earlier in time) and drawing without replacement until |parents(Vi)| = ci

b) max indegree - for each variable in the set of possible parents of Vi ,include it in parents(Vi) if a random draw from a uniform [0,1] is greater than cutoff = 1/|possible parents of V-i|, until either the possible parents of Vi are exhaused, or |parents(Vi)| = mxi whichever is first.

c) mean indegree - for each variable Vi, decide for each variable in the set of possible parents of Vi to include it in parents(Vi) if a random draw from a uniform [0,1] is greater than cutoff = 1/|possible parents of V-i|, until the possible parents of Vi are exhaused.

 
3. Specifying the Parametric Model
You must now interpret the graph as a parametric model. To begin this, double-click on the PM modelNode in the session editor (Figure 9). When you are finished, just close the PM editor window.
If you have specified an update graph, then you will be given four choices (Figure 14) of parametric model families, each of which we describe below.
Figure 9
These families are not exclusive, but each has advantages.
 
3.1 Glass Updating
In both Glass and General Updating parametric models, the current value of each variable Vi:L0 is set by a function with the following general form.
Update Function: Vi:L0 = Vi:L1 + rate[-Vi:L1 + Fi(parents(Vi:L0)/Vi:L1 )] + ei
where: 0 < rate £ 1, and ei is an "error" term drawn from a given probability distribution (discussed below), all variables are continuous, and Fi is a function specified by the user. The difference between Glass and General updating models involves only the form of the functions Fi.
 
3.1.1 Preliminary Binary Projection
Even though the variables in this parametric model class are continuous, Glass functions take boolean valued inputs, so some pre-processing is necessary. The inputs to the function Fi are the members of the set P = {parents(Vi:L0)/Vi:L1}, that is, the parents of Vi:L0 except for Vi:L1, which is already in the updating function. The idea behind Glass functions is to simplify the input to whether a given gene is expressing at “high” or “low” levels. We map each Vp Œ P to 0 (low, that is, below its average expression level of 0) or 1 (high, that is, above 0). We do this with the following binary projection BP:
BP: For each Vp Œ P, let Vp* = 1 if Vp > 0, and 0 otherwise.
Simulated values in for variables Vi will range mostly over the interval [-2.0, 2.0] and will oscillate above and below 0.0. Since raw microarray data is typically given as a ratio of microarray spot brightness to average spot intensity, where this ratio typically ranges from near 0.0 up to about 10.0, with averages around 1.0, it is useful to think of the simulated data as loglinear with respect to raw data—viz., log(x) of each intensity ratio x is recorded in the simulation in place of the intensity ratio x.
 
3.1.2 The Function Table for Fi
Since genetic regulators either inhibit or activate their targets, Edwards and Glass (2000, p.3) set the range of Fi to {-1,1}. To specify a particular Fi, construct a truth-table with 1 column for each Vp Œ P and one for the output of F-i. Thus the truth table will have 2|P| rows. For example, if P = {V3:L1, V5:L2}, then the function table for Fi is:
V3:L1*
V5:L2*
F
0
0
0
1
1
0
1
1
Filling in the Fi column in this function table specifies the particular instantiation of Fi used in the update function for variable Vi, thus this step is left to the Instantiated Model specification below.
By picking the Glass Updating parametric family, you are committing yourself to the class of models parametrized by the update function above and Glass functions for the Fi.
 
3.2 General Updating
Again, in both Glass and General Updating parametric models, the current value of each variable Vi:0 is set by a function with the following general form.

Update Function: Vi:L0 = Vi:L1 + rate[-Vi:L1 + Fi(parents(Vi:L0)/Vi:L1 )] + ei

where: 0 < rate £ 1, and ei is an "error" term drawn from a given probability distribution (discussed below), all variables are continuous, and Fi is a function specified by the user. The difference between Glass and General updating models involves only the form of the functions Fi. In General Updating parametric models, the user is free to specify any function for the Fi, not just Glass functions. Here we will use Tianjiao's GAM specifier to allow the user to specify, for each i, the function Fi.
 
3.3 GRN GAM
In this parametric family, the variables are continuous and the update function is slightly more general than the ones used in 3.1 and 3.2.
Update Function: Vi:L0 = Gi[parents(Vi:L0)] + ei
GAM stands for general additive model, so the only constraint on this parametric model is that the G-i are additive functions. The particular instantiaions of Gi for each i are fixed when you specify the instantiated model (IM). Here we will use Tianjiao's GAM specifier to allow the user to specify, for each i, the function Gi.
 
3.4 BN SEM
In this parametric family, the variables are discrete and the causal system is just interpreted as a standard discrete Bayes Network, that is, the update function is just a conditional probability table.
Update Function: P(Vi:L0) = Gi[parents(Vi:L0)]
At the parametric level, you need to specify the number of categories for each variable, as well as the value of each category. This is identical to the Bayes Net parametric model in general Tetrad 4 models.
 
4. Specifying the Instantiated Model
Once the parametric model has been specified, you need only specfiy values for the parameters in the corresponding instantiated model (IM). To do this, double-click on the IM modelNode in the session workbench. Close the IM editor to finish. Since the parameters depend on the PM chosen, we cover each of the above classes in turn.
 
4.1 Glass Updating
The free parameters of a Glass Updating model are:
  • the rate
  • the distributions over the error terms ei
  • the Boolean functions Fi
The error distributions are all assumed to be pairwise indedpendent, and normal with mean 0 and standard deviation SD(ei). The rate and error distributions default as follows.
 
Parameter Defaults

Rate = .1
ei ~ N(0,.05)

The range of Fi is {-1,1}. The number of possible functions for each Fi is , where P is the set of parents of Vi. For example, if P = {V3:L1, V5:L2}, then the uninstantiated function table for Fi is:
V3:L1*
V5:L2*
F
0
0
0
1
1
0
1
1
which can be filled out in 16 different ways. If P contains 5 parents, then there are 32 rows in Fi and there are 232 different ways you can instantiate Fi. Thus, we give you a choice of filling in all Fi randomly or manually, with randomly the default.
 
Randomly:
  • Draw Fi from: (p(Fi = 1) = .5, and p(Fi = -1) = .5, but
  • Ensure all inputs to Fi are non-trivial, that is, for each input Pk, insist that at least one pair of rows in the function table exist that are i) identical on all inputs besides Pk, ii) different on Pk, and different on the output Fi.
Manually: Allow user to assign the value of Fi in each row manually.
 
4.2 General Updating Models
Same as Glass Updating, but use Tianjiao's GAM interface for specifying the functions Gi.
 
4.3 GRN GAM
Here we will use Tianjiao's GAM specifier to allow the user to specify, for each i, the function Gi, and specify the error distributions as above.
 
4.4 BN SEM
Same as Tetrad 4 Bayes Net instantiated model specifier
 
5. Initializing the Cells
Given the full parametric model and its instantiation, and assuming that the same model is used for each individual cells, then values for N individuals over T times can be simulated in two stages: an "initialization" phase and an "update" phase. The initialization phase must assign values for all the variables (genes) in each individual for time step 1. After this, the update function specified above can be used to assign values to variables for a given individual, with the caveat that if mlag > 1, then not all the “parents” of a variable in time 2 will exist. In that case the update function will simply use the latest value of that variable that does exist as input.
Biologically, cells from the same tissue can be starved of nutrients, or manipulated in some other way and all brought into roughly the same state, that is, synchronized. They can then be shocked, e.g., exposed to nutrients, and let run.
Further, many genes in a cell are considered “housekeeping genes,” which express protein at some basal rate stably over time. In this version of the simulator we allow the user to choose among two methods for initializing values: synchronized or random.
 
5.1 Synchronized Initialization
  1. For each gene Vi in cell 1
    a. if Vi is a housekeeping gene, that is, its only parent in the update graph is itself, then let Vi:1 = 0
    b. else draw Vi from a standard normal distribution, i.e, N(0,1).

  2. Duplicate cell 1 N times, where N is the total number of individual cells.
5.2 Random Initialization
  1. For each gene Vi in each cell,
    a. if Vi is a housekeeping gene, that is, its only parent in the update graph is itself, then let Vi:1 = 0,
    b. else draw Vi:1 from a standard normal distribution, i.e, N(0,1)
 
6. Running a Simulation
To initiate a data generation process, double click on the red-die on the arrow from an IM modelNode to a Data modelNode (Figure 16). The program will only produce data if it has all the info it needs, otherwise it will prompt you.
Starting a Simulation
Figure 10: Starting a Simulation Run
The simulation will initialize the cells as specified above, and then generate as many time steps as desired. Not all time steps generated need to be stored, however. The updating process might happen at quite a different pace than data recording, and for inference this might matter. The simulator thus requires you to specify the first time step you want to store, and how often you want to store data.

Simulation Run Parameters:
Thus the parameters you must set, with their defaults in brackets, are as follows.

  1. Number of individual cells = N [100,000]
  2. Total time steps generated = T [4]
  3. First time step stored = TSstart [1]
  4. Interval for storage (step size ) = step [1]
 
7. Aggregation and Measurement Error
7.1 Aggregation by Dish
So far, we have modelled the progression of gene activity over time for individual cells. We cannot currently obtain biological data on this level, however. In microarray experiments, thousands (or millions) of cells are grown in each of several dishes. For example, Figure 18 shows two dishes with a million cells each.
Aggregation by Dish
Figure 11: Aggregation by Dish
Initial Dish to Dish Differences
Although they are intentionally minimized, small variations in nutrient or temperature make dish to dish variations inevitable, even at the beginning of an experiment.
To simulate these differences, we prompt the user for how much variation in expression levels we can expect between dishes (Figure 20). Let that quantity be the standard deviation of expression level difference due to the dish, which we will call sd(DV), with default at 10%.
Dish-to-Dish Variation Parameter
Figure 12: Dish-to-Dish Variation Parameter
For each dish Dj, we draw Dev(Dj) from a normal distribution with mean 100 and standard deviation = sd(DV), that is, N(100, sd(DV)).

We then adjust the initial* expression levels in all genes in all the cells in dish Dj by Dev(Dj).
For each dish Dj
    For each cell Ck in Dj,
        For each gene Gi in Ck at time 1 (Vi:1),
            Let Gi = Dev(Dj) % of Gi
 
Dishes and Time
In the usual studies, although several dishes might begin an experiment in a relatively “synchronized” state, each dish must be “frozen” before it can be processed for measurement, and thus we cannot use the same dish to measure cells at two different time points in the experiment. To measure two time points, we take a sample from one dish at time1 1and a sample from a different dish at time 2, e.g., Figure 22.
We discussed how we initialize the cells in section 5 above.
Time and Dishes
Figure 13:Time and Dishes
If the goal is to compare the expression levels of a particular gene at two different times to see if they substantially differ, then this constraint forces us to compare the level of a sample from one dish against a sample from another. This constraint will NOT be imposed by the simulator, but rather by the data analyst after data is generated and stored.
 
7.2 Measurement Error
After a dish is “frozen” at a time, its RNA is extracted. From this RNA, several samples can be drawn and each “measured” by exposing it to a microarray chip (chips can be re-used) and then digitally converting pixel intensities to gene expression levels (Figure 24).
Measurement Error
Figure 14: Measurement Error
Chips can be reused approximately four times. Samples from the same dish might vary slightly, chips definitely vary, the same chip functions differently from one use to another, and the pixel digitization routine might include error as well. Each “measurement” therefore, has five sources of potential error:

1. dish,
2. sample,
3. chip,
4. re-use of a chip, and
5. pixel digitization

We discussed how we will model the dish variability above. In this version of the simulator, we will NOT model dish to dish variability beyond initialization. To model the other sources of measurement error, we will proceed as follows.
After initializing the cells in an experiment, and adding dish to dish variabiltiy, we will have N cells partitioned into D dishes. After we run the experiment, in which each cell obeys the same causal laws but updates independently of each other, each cell has an expression level at each of t times.
 
After Running the Experiment
Gene Levels Before Measurement Error
Figure 15: Gene Levels Before Measurement Error
After RNA Extraction
Next, we model the RNA extraction process for each dish by aggregating all the cells in a dish, averaging the expression levels for each gene, and recording an average expression level for each gene in each dish at each time (Figure 27):
After RNA Extraction
Figure 16: After RNA Extraction
Next, we model the RNA extraction process for each dish by aggregating all the cells in a dish, averaging the expression levels for each gene, and recording an average expression level for each gene in each dish at each time (Figure 27):
pic of nasty formula pg17
 
Adding Measurement Error

There remain 4 sources of measurement error: sample, chip to chip, chip re-use, and pixel digitization error. In this version of the simulator - we will NOT model chip re-use variability. We treat each of the other as additive normal error with mean 0 and standard deviation = s.
Although we could just as easily treat 3 additive twoCycleErrors as one - for conceptual transparency we keep them separate.
Let Sample to Sample Variability error = es ~ N(0, sS)
Let Chip to Chip error = ec ~ N(0, sc)
Let Pixel digitization error = epd ~ N(0, spd)
Thus for each sample s taken from a dish d and measured, new vavalues of es , ec , and epd are drawn, and the average expression level for each gene Vi at time t is
Ms[Vi:t(d)] = Vi:t(d) + es + ec + epd
If we draw 4 samples from each dish, and don’t re-use any chips, then our final data table would look as follows:
After Measurement Error
Figure 17: After Measurement Error
Measurement Error Parameters
The parameters the user must specify for the measurement error model, with defaults in brackets, are:

1. Dish to Dish variability = sd(DV) [10]
2. Number of samples per dish = S [4]
3. Sample to Sample Variability sS, [.025]
4. Chip to Chip Variability sc, [.1]
5. Pixel Digitization spd, [.025]

Return to Top

 
8. References
Edwards, R., & Glass L. (2000). Combinatorial explosion in model gene networks., Chaos, V. 10, N. 3, September., pp. 1-14.
 
 

 





© 2015 - 2025 Weber Informatics LLC | Privacy Policy