All Downloads are FREE. Search and download functionalities are using the official Maven repository.

docs.javahelp.manual.genetics.genetics.html Maven / Gradle / Ivy

There is a newer version: 7.6.6
Show newest version






    GeneticSimulator


    




Simulating Genetic Regulatory Networks

Version 1.2
Richard Scheines and Joseph Ramsey

The General Model

  1. Time Series Graphs
  2. Lag Graphs
  3. Simulation Theory

Tetrad 4

  1. Specifying the Graph
  2. Specifying the Parametric Model
  3. Specifying the Instantiated Model
  4. Initializing the Cells
  5. Aggregation and Measurement Error
  6. Running a Simulation

Glossary of Terms
References


The General Model

To simulate data in Tetrad 4, you must specify a Lag graph and then interpret it as a parametric model. There are 4 choices of parametric models, where in each either i) the value of each variable at the current time is a function of its causal parents and an error term, or ii) the probability distribution of the variable is a function of its causal parents.

Understanding the General Model

Consider a discrete time series involving N individuals, each with M variables G1 through Gm. For example, Figure 11 represents a time series for three variables, each measured at 5 times.

 


Understanding Time Series Graphs

If we assume that the process modelled is stable over time, then we can represent the causal structure of the series with a time series graph that includes the smallest fragment of the series that repeats. The number of temporal slices in the time series graph is the longest lag of direct influence plus one. For example, the time series graph in Figure 33, which represents the series in Figure 11, needs three temporal slices to represent a repeating sequence, because G2 has a direct effect on G3 with a temporal lag of two.

 


Understanding Lag Graphs

In order to simulate data from this class of models, we need to express each variable (or its probability distribution) at an arbitrary "current time" as a function of its direct causes. To represent only the set of direct causes for each variable, we construct a Lag graph. For example, the Lag graph in Figure 55 expresses the causal information needed to simulate the time series represented in Figure 11 and Figure 33. The Lag graph consists of the 3 variables repeated in temporal slices from lag = mlag (most remote direct influence) to lag=0 (current time), but includes only edges that are into variables at lag = 0, that is, it includes only edges that represent direct influences into the current time.

 

Note: We connect all pairs of variables at the beginning of the repeating sequence with a double-headed arrow to represent an unconstrained causal connection, so that m-separation applied to this graph does not entail that these variables are independent, contrary to the fact that later versions of the same variable are causally connected in the time series graph

The time lag i in a Lag graph is indexed by ":Li". Thus, variable G1 at the current time (lag of 0) in an Lag graph is G1:L0, and the same variable two time slices in the past G1:L2. The constant mlag is the maximum lag of direct influence.

 


Undertanding the Data Generated

To simulate data in Tetrad 4, you must specify a Lag graph and then interpret it as a parametric model. There are 4 choices of parametric models, where in each either i) the value of each variable at the current time is a function of its causal parents and an error term, or ii) the probability distribution of the variable is a function of its causal parents. Given an instantiation of the chosen parametric model, and the assumption that the causal relations remain constant over time for each individual, values for N individuals (cells) over T times can be simulated to produce a data cube that is:

N (individuals) x M (variables) x T (times).



Although the cube has three dimensions, we will store it in the standard two, by repeating M columns for each time slice, as so:

The simulation to produce these data is run in two stages: an "initialization" phase and an "update" phase. The initialization phase must assign values for each individual for at least as many times as the maximum lag in the time series model, i.e, from time t = 1 until time t = mlag. After time � mlag, values can be assigned to variables for a given individual using the given instantiation of the parametric model that interprets the Lag graph.

Data collection regimes for protein expression involve tissues that contain thousands of cells, not all of which behave strictly identically and not all of which can be measured in isolation.

Since current technology cannot perfectly measure the levels, even relatively, of gene (or mRNA) expression, or levels of protein synthesis in cells, we also allow the user to specify a measurement model that aggregates cells by dish and models measurement error.

 


Using Tetrad: Setting Up the Program


This section will explain how to specify the graph, choose a parametric model to interpret the graph, instantiate the parameters of the model, pick an initialization routine, and finally specify the measurement model.


When the program opens, it will display the empty workbench. In order to generate data, you need to specify a graph, a parametric model (PM), an instantiation of this model (IM), and connect them all to a Data modelNode. These objects can be deposited on the workbench by clicking on their icon in the left and clicking where you want them located on the workbench, and then by clicking on the "Flow-Charter" icon on the left and then dragging from one modelNode to another to connect them. The skeleton for a simulation looks like Figure 9.

After double-clicking on the graph modelNode - you will be given a choice (Figure 11) of whether you want to specify a regular graph or a Lag graph.

 


Specifying The Graph

You will then be prompted for whether you want to specify the graph manually or randomly, with manually the default.

2.1 Manual Graph Specification

  1. User Specifies M, the number of variables
    (range: 1 to 500, default = 5)

  2. User Specifies mlag
    (range: 1 to 5, default = 1)

  3. A default Lag graph is drawn for user in the graph drawing window with an arrow from each Gi:L1 to Gi:L0, e.g., with defaults: mlag = 1 and M = 5 (Figure 13)


  4. User completes the Lag graph manually, that is, they add edges from variables at lag > 0 into variables at lag = 0. No other edges are allowed in this representation of the series.

2.2 Random Graph Specification

If you choose Random Graph Specification, you will be confronted with the dialogue box in Figure 15, which asks you to set four parameters.

  1. ) User Specifies M, the number of variables in each individual
    (range: 1 to 500, default = 5)

  2. User Specifies mlag
    (range: 1 to 5, default = 1)

  3. User chooses and sets the value of exactly one of:

    a) constant indegree ci (range: 0 to M*mlag, default = 2)
    b) max indegree mxi (range: 0 to M*mlag, default = 2)
    c) mean indegree mni (range: 0 to M, default = 2)

    a) constant indegree - choose parents(Gi) by putting a uniform distribution over the possible parents of Gi (that is, all variables earlier in time) and drawing without replacement until |parents(Gi)| = ci
    b) max indegree - for each variable in the set of possible parents of Gi ,include it in parents(Gi) if a random draw from a uniform [0,1] is greater than cutoff = 1/|possible parents of Gi|, until either the possible parents of Gi are exhaused, or |parents(Gi)| = mxi whichever is first.
    c) mean indegree - for each variable Gi, decide for each variable in the set of possible parents of Gi to include it in parents(Gi) if a random draw from a uniform [0,1] is greater than cutoff = 1/|possible parents of Gi|, until the possible parents of Gi are exhaused.

  4. User chooses the approximate percent of unregulated genes. A gene is "unregulated" if its value at a time does not depend on any other gene besides itself at a previous time.
    (range: 0-100), default = 10)

2.3 Output and Storage

If the graph involves a managable number of vertices (<30), then the graph editor represents it pictorially, allowing you to edit the graph by adding or taking away edges. For example, in Figure 17 we show a Lag graph with five genes and a maxlag = 2 after adding a few other regulating connections.


Figure 17: Lag Graph with 5 Genes and Mlag=2

If the number of genes is large, e.g. 5,000, then representing the Lag graph pictorially is a waste. In that case we change to a textual representation in which each gene is given a row, and the genes that regulate (cause) it are listed on its row. For example, Figure 19 shows the top 20 or so lines of a Lag graph with 500 variables. In this representation, Gene 5 (G005) is regulated by itself one time step back (G005:1) and by Gene 461 two time steps back (G461:2). This representation can be saved as a text file to disk and then processed by software of your choice.


Figure 19: Large Lag Graph Represented Textually

 


Specifying the Parametric Model


Having specified the Lag graph, you must now intrepret the graph as a parametric model. Currently, only one type of parametric model is implemented-- the "Boolean Glass Gene Parametric Model." When you double-click the PM modelNode at this, point, that type of parametric model will automatically be chosen for you.

Eventually, four types of parametric models will be implemented, as follows:

  1. Boolean Glass Gene Parametric Model
  2. General Glass Gene Parametric Model
  3. General Additive Model Gene Parametric Model
  4. Bayes Net Standard Error Model

These families are not exclusive, but each has advantages.


3.1 Glass Updating
- Implemented.

In both Glass (Edwards & Glass, 2000) and General Updating parametric models, the value of each gene Gi at the current time (a lag of 0) Gi:L0 is set by a function with the following general form:


where: is an "error" term drawn from a given probability distribution (discussed below), all variables are continuous, and Fi is a function specified by the user. The difference between Glass and General updating models involves only the form of the functions Fi.


3.1.1 Preliminary Binary Projection

Even though the variables in this parametric model class are continuous, Glass functions take boolean valued inputs, so some pre-processing is necessary. The inputs to the function Fi are the members of the set P = {parents(Gi:L0)/Gi:L1}, that is, the parents of Gi:L0 except for Gi:L1, which is already in the updating function. The idea behind Glass functions is to simplify the input to whether a given gene is expressing at "high" or "low" levels. We map each Gp � P to 0 (low, that is, below its average expression level of 0) or 1 (high, that is, above 0). We do this with the following binary projection BP:


Simulated values in for variables Gi will range mostly over the interval [-2.0, 2.0] and will oscillate above and below 0.0. Since raw microarray data is typically given as a ratio of microarray spot brightness to average spot intensity, where this ratio typically ranges from near 0.0 up to about 10.0, with averages around 1.0, it is useful to think of the simulated data as loglinear with respect to raw data-viz., log(x) of each intensity ratio x is recorded in the simulation in place of the intensity ratio x.

Note: We thank Stuart Kauffmann for pointing us to the Glass model.

3.1.2 The Function Table for Fi

Since genetic regulators either inhibit or activate their targets, Edwards and Glass (2000, p.3) set the range of Fi to {-1,1}. To specify a particular Fi, construct a truth-table with 1 column for each and one for the output of Fi. Thus the truth table will have 2 to the power of P rows. For example, if P = {G3:L1, G5:L2}, then the function table for Fi is:

Filling in the Fi column in this function table specifies the particular instantiation of Fi used in the update function for variable Gi, thus this step is left to the Instantiated Model specification below.

By picking the Glass Updating parametric family, you are committing yourself to the class of models parametrized by the update function above and Glass functions for the Fi.


3.2 General Updating - Unimplemented.

Again, in both Glass and General Updating parametric models, the current value of each variable Gi:0 is set by a function with the following general form.


where: is an "error" term drawn from a given probability distribution (discussed below), all variables are continuous, and Fi is a function specified by the user. The difference between Glass and General updating models involves only the form of the functions Fi. In General Updating parametric models, the user is free to specify any function for the Fi, not just Glass functions. Here we will use Tianjiao's GAM specifier to allow the user to specify, for each i, the function Fi.


3.3 GRN GAM - Unimplemented.

In this parametric family, the variables are continuous and the update function is slightly more general than the ones used in 3.1 and 3.2.


GAM stands for general additive model, so the only constraint on this parametric model is that the Gi are additive functions. The particular instantiaions of Fi for each i are fixed when you specify the instantiated model (IM). Here we will use Tianjiao's GAM specifier to allow the user to specify, for each i, the function Gi.



3.4 BN SEM
- Unimplemented.

In this parametric family, the variables are discrete and the causal system is just interpreted as a standard discrete Bayes Network, that is, the update function is just a conditional probability table. <br>


At the parametric level, you need to specify the number of categories for each variable, as well as the value of each category. This is identical to the Bayes Net parametric model in general Tetrad 4 models.


Specifying the Instantiated Model


Once the parametric model has been specified, you need only specfiy values for the parameters in the corresponding instantiated model (IM). To do this, double-click on the IM modelNode in the session workbench. Close the IM editor to finish. Since the parameters depend on the PM chosen, we cover each of the above classes in turn.


4.1 Glass Updating

Recall that the update function used in Glass model is:


The free parameters for a Glass model are thus:

  • the rate parameters rate1 and rate 2
  • the distributions over the error terms Ei
  • the Boolean functions F1
The error distributions are all assumed to be pairwise indedpendent, and normal with mean 0 and standard deviation SD(Ei).

Parameter Defaults
Rate 1 = .1
Rate 2 = .5
Ei ~ N(0,.05)

By the structure of the Glass function, an unregulated gene (no other genes besides itself affecting it) will have an equilbirum value of 0. Rate 1 determines how quickly such genes return to equilbrium. If Rate 1 is set = .1, for example (its default), then in each time step a gene with no regulators will return 1/10th of the way toward 0, whatever its previous value, before noise. Rate2 governs the influence of other regulators. The Glass functions Fi output +1 (promoting expression) or -1 (inhibiting expression). If Rate 2 is set = .5, then in one time step the contribution to a gene is either +.5 or -.5.

The error terms are meant to simulate random biological error in the process of gene expression. We give a default of mean 0 and standard deviation = .05.

The range of Fi is {-1,1}. The number of possible functions for each Fi is , where P is the set of parents of Gi. For example, if P = {G3:L1, G5:L2}, then the uninstantiated function table for Fi is:



which can be filled out in 16 different ways. If P contains 5 parents, then there are 32 rows in Fi and there are 2^32 different ways you can instantiate Fi. Thus, we give you a choice of filling in all Fi randomly or manually, with randomly the default.


Randomly:

  • Draw Fi from: (p(Fi = 1+) = .5, and p(Fi = -1) = .5, but
  • Ensure all inputs to Fi are non-trivial, that is, for each input Pk, insist that at least one pair of rows in the function table exist that are
    i) identical on all inputs besides Pk,
    ii) different on Pk, and different on the output Fi.

Manually: Allow the user to assign the value of Fi in each row manually.

The window that appears in Tetrad 4 in which you set the parameters for a Boolean Glass Function IM is shown in Figure 23. The defaults for the rate parameters appear on the bottom of the screen. The opening view is of the Boolean function for gene G1, but any gene's function can be selected. In , the function regulating G1 promotes G1 in all cases except when both G2 one time lag back and G3 one time lag back are both expressing above their mean, in which case G1 is inhibited. Boolean functions can be manually edited by clicking in the appropriate Fcn. Value box and replacing the value with +1 or -1.


The standard deviation of the error distributions can be changed by selecting the "Errors" tab, and then editing the default of .05 to whatever value you prefer.


5. Initializing the Cells


Given the full parametric model and its instantiation, and assuming that the same model is used for each individual cell, then values for N individuals over T times can be simulated in two stages: an "initialization" phase and an "update" phase. The initialization phase must assign values for all the genes in each individual for time step 1. After this, the update function specified above can be used to assign values to variables for a given individual, with the caveat that if mlag > 1, then not all the "parents" of a variable in time 2 will exist. In that case the update function will simply use the latest value of that variable that does exist as input.

Biologically, cells from the same tissue can be starved of nutrients, or manipulated in some other way and all brought into roughly the same state, that is, synchronized. They can then be shocked, e.g., exposed to nutrients, and let run.

Further, many genes in a cell are considered "housekeeping genes," which express protein at some basal rate stably over time. In this version of the simulator we allow the user to choose among two methods for initializing values: synchronized or random.


5.1 Synchronized Initialization

  1. For each gene Gi in cell 1,
  2. a. if Gi is a housekeeping gene, that is, its only parent in the Lag graph is itself, then let Gi:1 = 0,
    b. else draw Gi from a standard normal distribution, i.e, N(0,1)
  3. Duplicate cell 1 N times, where N is the total number of individual cells.

5.2 Random Initialization

  1. For each gene Gi in each cell,
  2. a. if Gi is a housekeeping gene, that is, its only parent in the Lag graph is itself, then let Gi:1 = 0,
    b. else draw Gi:1 from a standard normal distribution, i.e, N(0,1)

 


Aggregation and Measurement Error


6.1 Aggregation by Dish

So far, we have modelled the progression of gene activity over time for individual cells. We cannot currently obtain biological data on this level, however. In microarray experiments, thousands (or millions) of cells are grown in each of several dishes. For example, Figure 26 shows two dishes with a million cells each.


A microarray experiment involves at least the following sequence of steps to begin the experiment:

  1. Separate single cell culture into dishes
  2. Grow cells in dishes
  3. Synchronize Cells
  4. Expose dishes to treatment

Initial Dish to Dish Differences

Although they are intentionally minimized, small variations in nutrient or temperature make dish to dish variations inevitable, even at the beginning of an experiment.

To simulate these differences, we need to specify how much variation in expression levels we can expect between dishes. Let that quantity be the standard deviation of expression level difference due to the dish, which we will call sd(DV), with default at 10%.

For each dish Dj, we draw Dev(Dj) from a normal distribution with mean 100 and standard deviation = sd(DV), that is, N(100, sd(DV)).

We then adjust the initial^3 expression levels in all genes in all the cells in dish Dj by Dev(Dj):

For each dish Dj

For each cell Ck in Dj,
For each gene Gi in Ck at time 1 (Gi:1),

Let Gi = Dev(Dj) % of Gi

Dishes and Time

In the usual studies, although several dishes might begin an experiment in a relatively "synchronized" state, each dish must be "frozen" before it can be processed for measurement, and thus we cannot use the same dish to measure cells at two different time points in the experiment. To measure two time points, we take a sample from one dish at time 1 and a sample from a different dish at time 2, e.g., Figure 29.



If the goal is to compare the expression levels of a particular gene at two different times to see if they substantially differ, then this constraint forces us to compare the level of a sample from one dish against a sample from another. This constraint will NOT be imposed by the simulator, but rather by the data analyst after data is generated and stored.

 



6.2 Measurement Error


After a dish is "frozen" at a time, its RNA is extracted. From this RNA, several samples can be drawn and each "measured" by exposing it to a microarray chip (chips can be re-used) and then digitally converting pixel intensities to gene expression levels (Figure 31).




Chips can be reused approximately four times. Samples from the same dish might vary slightly, chips definitely vary, the same chip functions differently from one use to another, and the pixel digitization routine might include error as well. Each "measurement" therefore, has five sources of potential error:

  1. dish,
  2. sample,
  3. chip,
  4. re-use of a chip, and
  5. pixel digitization


We discussed how we will model the dish variability above. In this version of the simulator, we will NOT model dish to dish variability beyond initialization. To model the other sources of measurement error, we will proceed as follows.

After initializing the cells in an experiment, and adding dish to dish variabiltiy, we will have N cells partitioned into D dishes. After we run the experiment, in which each cell obeys the same causal laws but updates independently of each other, each cell has an expression level at each of t times.

After Running the Experiment



We call this the Raw data from the experiment.

After RNA Extraction

Next, we model the RNA extraction process for each dish by aggregating all the cells in a dish, averaging the expression levels for each gene, and recording an average expression level for each gene in each dish at each time (Figure 34):

 

The average expression level for a gene Gi at time t in a dish d, is:


Adding Measurement Error There remain 4 sources of measurement error: sample, chip to chip, chip re-use, and pixel digitization error. In this version of the simulator - we will NOT model chip re-use variability. We treat each of the other as additive normal error with mean 0 and standard deviation

Let Sample to Sample Variability error

Let Chip to Chip error

Pixel digitization error

For each sample s taken from a dish d and measured, new values of are drawn. For each gene in each sample, a new value of is drawn. Thus the average measured expression level for each gene Gi at time t is:



If we draw 4 samples from each dish, and don't re-use any chips, then our final data table would look as follows:



We call this the Measurement Data.
To summarize, the parameters the user must specify for the measurement error model, with defaults in brackets are:

  1. Dish to Dish variability = sd(DV) [10]
  2. Number of samples per dish = S [4]
  3. Sample to Sample Variability S, [.025]
  4. Chip to Chip Variability c, [.1]
  5. Pixel Digitization pd, [.025]


Running a Simulation


To initiate a data generation process, double click on the red-die on the arrow from an IM modelNode to a Data modelNode (). The program will only produce data if it has all the info it needs, otherwise it will prompt you.



Simulation Run Parameters:

After clicking the red die, a dialogue box comes up showing the parameters you must specify, along with their default values (Figure 38).



We discussed the measurement model parameters in section 6. For the Simulation Parameters, the number of dishes defaults to 1 and the number of cells per dish to 100,000. The number of time steps generated defaults at 4, and the user has the option of when to start storing data (default at time step 1). The user can also choose how often to store data, with the default every step.

Finally, because there are often thousands of cells and thousands of genes in each cell, many simulations will involve millions of gene expression levels at each time. Call this the raw data. The measured, as opposed to raw data, involves only a few samples from each dish, each of which contains as many measurements as there are genes. Since storing the raw data is expensive relative to the aggregated data with measurement error - we give the user a choice as to whether to store raw data, with the default = no. If the user chooses to store raw data, both data sets will be contained in the data nodes on the workbench in Tetrad.

 

Glossary

Bayes Network
definition

chip
A micro-array

chip to chip variability
definition

dish to dish variability
definition

edge
definition

error term
definition

Generally Adititve Model (GAM)
This model outputs an additive function that is not alwasys linear.

Initialization
Synchonized: all copies of orginial
Random: Different levels of gene expression

Instantiated Model (IM)
definition

Lag Graph
definition

Maximum Lag
definition

Measurement Data
definition

Parametric Model (PM)
definition

Pixel Dish
definition

Rate Parameters
definition

Raw Data
definition

Regular Graph
definition

Regulation
definition

Sample to Sample variability
definition

SEM
definition

Boolean Glass Gene Parametric Model


General Glass Gene Parametric Model


General Additive Model Gene Parametric Model


Bayes Net Standard Error Model
definition

Time Series Graph
definition

Update Function
Defines how influence is transmitted to gene in current time, can include variables from past times

 

References

Edwards, R., & Glass L. (2000). Combinatorial explosion in model gene networks., Chaos, V. 10, N. 3, September., pp. 1-14.




© 2015 - 2025 Weber Informatics LLC | Privacy Policy