docs.javahelp.manual.genetics.genetics.html Maven / Gradle / Ivy

Go to download






    GeneticSimulator


    





    
        Simulating Genetic Regulatory Networks
    

Version
    1.2

    Richard Scheines and Joseph Ramsey
The
    General Model

    Time
        Series Graphs
    Lag
        Graphs 
    Simulation
        Theory

Tetrad
    4

    Specifying
        the Graph
    Specifying
        the Parametric Model
    Specifying
        the Instantiated Model
    Initializing
        the Cells
    Aggregation
        and Measurement Error
    Running
        a Simulation

Glossary
    of Terms

References
    

    


    The General Model
To simulate data in Tetrad 4, you must specify a Lag
    graph and then interpret it as a parametric model. There are 4 choices of
    parametric models,
    where in each either i) the value of each variable at the current time is a
    function of its causal parents and an error term, or ii) the probability distribution
    of the variable is a function of its causal parents. 
Understanding the General Model
Consider a discrete time series involving N individuals,
    each with M variables G1 through Gm. For example, Figure 11 represents a time
    series for three variables, each measured at 5 times. 

 
  

    Understanding Time Series Graphs
If we assume that the process modelled is stable over
    time, then we can represent the causal structure of the series with a time
        series graph that includes the smallest fragment of the series that repeats.
    The number of temporal slices in the time series graph is the longest lag of
    direct influence plus one. For example, the time series graph in Figure 33,
    which represents the series in Figure 11, needs three temporal slices to represent
    a repeating sequence, because G2 has a direct effect on G3 with a temporal lag
    of two.

 
 

    Understanding Lag Graphs
In order to simulate data from this class of models,
    we need to express each variable (or its probability distribution) at an arbitrary
    "current time" as a function of its direct causes. To represent only the set
    of direct causes for each variable, we construct a Lag
        graph. For example, the Lag
        graph in Figure 55 expresses the causal information needed to simulate the
    time series represented in Figure 11 and Figure 33. The Lag
        graph consists of the 3 variables repeated in temporal slices from lag =
    mlag (most remote direct influence) to lag=0 (current time), but includes only
        edges that are into variables at lag = 0, that is, it includes only edges
        that represent direct influences into the current time.

  
Note: We connect all pairs of variables at the
    beginning of the repeating sequence with a double-headed arrow to represent
    an unconstrained causal connection, so that m-separation applied to this graph
    does not entail that these variables are independent, contrary to the fact that
    later versions of the same variable are causally connected in the time series
    graph 
The time lag i in a Lag
    graph is indexed by ":Li". Thus, variable G1 at the current time (lag of
    0) in an Lag graph is G1:L0, and the same variable two time slices in the past
    G1:L2. The constant mlag is the maximum lag of direct influence. 
 
 

    Undertanding the Data Generated
To simulate data in Tetrad 4, you must specify a Lag
    graph and then interpret it as a parametric model. There are 4 choices of
    parametric models, where in each either i) the value of each variable at the
    current time is a function of its causal parents and an error term, or ii) the
    probability distribution of the variable is a function of its causal parents.
    Given an instantiation of the chosen parametric model, and the assumption that
    the causal relations remain constant over time for each individual, values for
    N individuals (cells) over T times can be simulated to produce a data cube that
    is: 

    N (individuals) x M (variables) x T (times).



    

    Although the cube has three dimensions, we will store it in the standard two,
    by repeating M columns for each time slice, as so:

The simulation to produce these data is run in two stages:
    an "initialization"
        phase and an "update" phase. The initialization phase must assign values
    for each individual for at least as many times as the maximum
        lag in the time series model, i.e, from time t = 1 until time t = mlag.
    After time � mlag, values can be assigned to variables for a given individual
    using the given instantiation of the parametric model that interprets the Lag
        graph. 

    

    Data collection regimes for protein expression involve tissues that contain
    thousands of cells, not all of which behave strictly identically and not all
    of which can be measured in isolation. 

    

    Since current technology cannot perfectly measure the levels, even relatively,
    of gene (or mRNA) expression, or levels of protein synthesis in cells, we also
    allow the user to specify a measurement model that aggregates cells by dish
    and models measurement error.
 
 

    Using Tetrad: Setting Up the Program 


    This section will explain how to specify the graph, choose a parametric model
    to interpret the graph, instantiate the parameters of the model, pick an initialization
    routine, and finally specify the measurement model. 


    
When the program opens, it will display the empty workbench.
    In order to generate data, you need to specify a graph, a parametric
        model (PM), an instantiation
        of this model (IM), and connect them all to a Data modelNode. These objects can
    be deposited on the workbench by clicking on their icon in the left and clicking
    where you want them located on the workbench, and then by clicking on the "Flow-Charter"
    icon on the left and then dragging from one modelNode to another to connect them.
    The skeleton for a simulation looks like Figure 9. 

    

After double-clicking on the graph modelNode - you will be
    given a choice (Figure 11) of whether you want to specify a regular graph or
    a Lag graph.

 
 

    Specifying The Graph 
You will then be prompted for whether you want to specify
    the graph manually or randomly, with manually the default.
2.1 Manual Graph Specification 

     User Specifies M, the number of variables 

        (range: 1 to 500, default = 5)

        

    
     User Specifies mlag 

        (range: 1 to 5, default = 1)

        

    
     A default Lag graph is drawn for user in the graph
        drawing window with an arrow from each Gi:L1 to Gi:L0, e.g., with defaults:
        mlag = 1 and M = 5 (Figure 13)

         

        

    
    User completes the Lag
        graph manually, that is, they add edges from variables at lag > 0 into
        variables at lag = 0. No other edges are allowed in this representation of
        the series.

2.2 Random Graph Specification 
If you choose Random Graph Specification, you will be
    confronted with the dialogue box in Figure 15, which asks you to set four parameters.

    

     

     ) User Specifies M, the number of variables in each
        individual 

        (range: 1 to 500, default = 5)

        

    
     User Specifies mlag 

        (range: 1 to 5, default = 1) 

        

    
     User chooses and sets the value of exactly one of:

        

        a) constant indegree ci (range: 0 to M*mlag, default = 2) 

        b) max indegree mxi (range: 0 to M*mlag, default = 2) 

        c) mean indegree mni (range: 0 to M, default = 2) 

        

        a) constant indegree - choose parents(Gi) by putting a uniform distribution
        over the possible parents of Gi (that is, all variables earlier in time) and
        drawing without replacement until |parents(Gi)| = ci 

        b) max indegree - for each variable in the set of possible parents of Gi ,include
        it in parents(Gi) if a random draw from a uniform [0,1] is greater than cutoff
        = 1/|possible parents of Gi|, until either the possible parents of Gi are
        exhaused, or |parents(Gi)| = mxi whichever is first. 

        c) mean indegree - for each variable Gi, decide for each variable in the set
        of possible parents of Gi to include it in parents(Gi) if a random draw from
        a uniform [0,1] is greater than cutoff = 1/|possible parents of Gi|, until
        the possible parents of Gi are exhaused. 

        

    
    User chooses the approximate percent of unregulated
        genes. A gene is "unregulated" if its value at a time does not depend on any
        other gene besides itself at a previous time. 

        (range: 0-100), default = 10) 

2.3 Output and Storage 
If the graph involves a managable number of vertices
    (<30), then the graph editor represents it pictorially, allowing you to edit
    the graph by adding or taking away edges. For example, in Figure 17 we show
    a Lag graph with five genes and a maxlag = 2 after adding a few other regulating
    connections. 

    

    

    Figure 17: Lag Graph with 5 Genes and Mlag=2 

    

    If the number of genes is large, e.g. 5,000, then representing the Lag
        graph pictorially is a waste. In that case we change to a textual representation
    in which each gene is given a row, and the genes that regulate (cause) it are
    listed on its row. For example, Figure 19 shows the top 20 or so lines of a
    Lag graph
    with 500 variables. In this representation, Gene 5 (G005) is regulated by itself
    one time step back (G005:1) and by Gene 461 two time steps back (G461:2). This
    representation can be saved as a text file to disk and then processed by software
    of your choice.
 

    Figure 19: Large Lag Graph Represented Textually
 
 

    Specifying the Parametric Model

    

    Having specified the Lag
        graph, you must now intrepret the graph as a parametric model. Currently,
        only one type of parametric model is implemented-- the "Boolean Glass Gene
        Parametric Model." When you double-click the PM modelNode at this, point, that
        type of parametric model will automatically be chosen for you.
Eventually, four types of parametric models will be implemented, as follows:

    Boolean
        Glass Gene Parametric Model 
    General
        Glass Gene Parametric Model
    General
        Additive Model Gene Parametric Model 
    Bayes
        Net Standard Error Model 

These families are not exclusive, but each has advantages.

 

    3.1 Glass Updating
    - Implemented.
In both Glass (Edwards & Glass, 2000) and General Updating
    parametric models, the value of each gene Gi at the current time (a lag of 0)
    Gi:L0 is set by a function with the following general form:

    

    

    where:  is an
    "error" term drawn from a given probability distribution (discussed below),
    all variables are continuous, and Fi is a function specified by the user. The
    difference between Glass and General updating models involves only the form
    of the functions Fi.
 

    3.1.1 Preliminary Binary Projection

    

    Even though the variables in this parametric model class are continuous, Glass
    functions take boolean valued inputs, so some pre-processing is necessary. The
    inputs to the function Fi are the members of the set P = {parents(Gi:L0)/Gi:L1},
    that is, the parents of Gi:L0 except for Gi:L1, which is already in the updating
    function. The idea behind Glass functions is to simplify the input to whether
    a given gene is expressing at "high" or "low" levels. We map each Gp � P to
    0 (low, that is, below its average expression level of 0) or 1 (high, that is,
    above 0). We do this with the following binary projection BP:

     

    

    Simulated values in for variables Gi will range mostly over the interval [-2.0,
    2.0] and will oscillate above and below 0.0. Since raw microarray data is typically
    given as a ratio of microarray spot brightness to average spot intensity, where
    this ratio typically ranges from near 0.0 up to about 10.0, with averages around
    1.0, it is useful to think of the simulated data as loglinear with respect to
    raw data-viz., log(x) of each intensity ratio x is recorded in the simulation
    in place of the intensity ratio x.

    

    Note: We thank Stuart Kauffmann for pointing us to the Glass model. 
3.1.2 The Function Table for Fi

    

    Since genetic regulators either inhibit or activate their targets, Edwards
        and Glass (2000, p.3) set the range of Fi to {-1,1}. To specify a particular
    Fi, construct a truth-table with 1 column for each 
    and one for the output of Fi. Thus the truth table will have 2 to the power
    of P rows. For example, if P = {G3:L1, G5:L2}, then the function table for Fi
    is:

    

 

Filling in the Fi column in this function table specifies
    the particular instantiation of Fi used in the update function for variable
    Gi, thus this step is left to the Instantiated Model specification below.

    

    By picking the Glass Updating parametric family, you are committing yourself
    to the class of models parametrized by the update function above and Glass functions
    for the Fi. 


 

    3.2 General
        Updating - Unimplemented.

        

        Again, in both Glass and General Updating parametric models, the current value
        of each variable Gi:0 is set by a function with the following general form.

        

        

        where: is
        an "error" term drawn from a given probability distribution (discussed below),
        all variables are continuous, and Fi is a function specified by the user. The
        difference between Glass and General updating models involves only the form
        of the functions Fi. In General Updating parametric models, the user is free
        to specify any function for the Fi, not just Glass functions. Here we will use
        Tianjiao's GAM specifier to allow the user to specify, for each i, the function
        Fi. 
 

    3.3 GRN
        GAM - Unimplemented.

        

        In this parametric family, the variables are continuous and the update function
        is slightly more general than the ones used in 3.1 and 3.2.

         

        

        GAM stands for general additive model, so the only constraint on this parametric
        model is that the Gi are additive functions. The particular instantiaions of
        Fi for each i are fixed when you specify the instantiated model (IM). Here we
        will use Tianjiao's GAM specifier to allow the user to specify, for each i,
        the function Gi. 


     

        3.4 BN SEM
     - Unimplemented.

        

        In this parametric family, the variables are discrete and the causal system
        is just interpreted as a standard discrete Bayes Network, that is, the update
        function is just a conditional probability table. <br>

         

        

        At the parametric level, you need to specify the number of categories for each
        variable, as well as the value of each category. This is identical to the Bayes
        Net parametric model in general Tetrad 4 models. 
 

    Specifying the Instantiated Model

    

    Once the parametric model has been specified, you need only specfiy values for
    the parameters in the corresponding instantiated
        model (IM). To do this, double-click on the IM modelNode in the session workbench.
    Close the IM editor to finish. Since the parameters depend on the PM chosen,
    we cover each of the above classes in turn.

    

    4.1 Glass Updating 

    

    Recall that the update
        function used in Glass model is:

     

    

    The free parameters for a Glass model are thus: 

    the rate parameters rate1 and rate 2
    the distributions over the error terms Ei
    the Boolean functions F1

The error distributions are all assumed to be pairwise indedpendent,
    and normal with mean 0 and standard deviation SD(Ei). 

    

    Parameter Defaults 

    Rate 1 = .1

    Rate 2 = .5

    Ei ~ N(0,.05) 
By the structure of the Glass function, an unregulated
    gene (no other genes besides itself affecting it) will have an equilbirum value
    of 0. Rate 1 determines how quickly such genes return to equilbrium. If Rate
    1 is set = .1, for example (its default), then in each time step a gene with
    no regulators will return 1/10th of the way toward 0, whatever its previous
    value, before noise. Rate2 governs the influence of other regulators. The Glass
    functions Fi output +1 (promoting expression) or -1 (inhibiting expression).
    If Rate 2 is set = .5, then in one time step the contribution to a gene is either
    +.5 or -.5.

    

    The error terms are meant to simulate random biological error in the process
    of gene expression. We give a default of mean 0 and standard deviation = .05.

    

    The range of Fi is {-1,1}. The number of possible functions for each Fi is ,
    where P is the set of parents of Gi. For example, if P = {G3:L1, G5:L2}, then
    the uninstantiated function table for Fi is:

    

    

    

    which can be filled out in 16 different ways. If P contains 5 parents, then
    there are 32 rows in Fi and there are 2^32 different ways you can instantiate
    Fi. Thus, we give you a choice of filling in all Fi randomly or manually, with
    randomly the default. 


    Randomly: 

    Draw Fi from: (p(Fi = 1+) = .5, and p(Fi = -1) = .5,
        but
    Ensure all inputs to Fi are non-trivial, that is,
        for each input Pk, insist that at least one pair of rows in the function table
        exist that are

        i) identical on all inputs besides Pk, 

        ii) different on Pk, and different on the output Fi.



    Manually: Allow the user to assign the value of Fi in each row manually.

    

    The window that appears in Tetrad 4 in which you set the parameters for a Boolean
    Glass Function IM is shown in Figure 23. The defaults for the rate parameters
    appear on the bottom of the screen. The opening view is of the Boolean function
    for gene G1, but any gene's function can be selected. In , the function regulating
    G1 promotes G1 in all cases except when both G2 one time lag back and G3 one time
    lag back are both expressing above their mean, in which case G1 is inhibited.
    Boolean functions can be manually edited by clicking in the appropriate Fcn. Value
    box and replacing the value with +1 or -1.

    

    

    The standard deviation of the error distributions can be changed by selecting
    the "Errors" tab, and then editing the default of .05 to whatever value you prefer.
 

    5. Initializing the Cells

    

    Given the full parametric model and its instantiation, and assuming that the
    same model is used for each individual cell, then values for N individuals over
    T times can be simulated in two stages: an "initialization" phase and an "update"
    phase. The initialization phase must assign values for all the genes in each
    individual for time step 1. After this, the update function specified above
    can be used to assign values to variables for a given individual, with the caveat
    that if mlag > 1, then not all the "parents" of a variable in time 2 will exist.
    In that case the update function will simply use the latest value of that variable
    that does exist as input.

    

    Biologically, cells from the same tissue can be starved of nutrients, or manipulated
    in some other way and all brought into roughly the same state, that is, synchronized.
    They can then be shocked, e.g., exposed to nutrients, and let run. 

    

    Further, many genes in a cell are considered "housekeeping genes," which express
    protein at some basal rate stably over time. In this version of the simulator
    we allow the user to choose among two methods for initializing values: synchronized
    or random. 


    5.1 Synchronized Initialization 



    For each gene Gi in cell 1,

    
     a. if Gi is a housekeeping gene, that is,
        its only parent in the Lag graph is itself, then let Gi:1 = 0,

        b. else draw Gi from a standard normal distribution, i.e, N(0,1) 
    Duplicate cell 1 N times, where N is the total number
        of individual cells.

5.2 Random Initialization 



    For each gene Gi in each cell, 

    
     a. if Gi is a housekeeping gene, that is,
        its only parent in the Lag graph is itself, then let Gi:1 = 0, 

        b. else draw Gi:1 from a standard normal distribution, i.e, N(0,1) 

 
  

    Aggregation and Measurement Error

    

    6.1 Aggregation by Dish 

    

    So far, we have modelled the progression of gene activity over time for individual
    cells. We cannot currently obtain biological data on this level, however. In
    microarray experiments, thousands (or millions) of cells are grown in each of
    several dishes. For example, Figure 26 shows two dishes with a million cells
    each. 

    

    

    A microarray experiment involves at least the following sequence of steps to
    begin the experiment: 

    Separate single cell culture into dishes 
    Grow cells in dishes
    Synchronize Cells
    Expose dishes to treatment

Initial Dish to Dish Differences 

    

    Although they are intentionally minimized, small variations in nutrient or temperature
    make dish to dish variations inevitable, even at the beginning of an experiment.
    

    

    To simulate these differences, we need to specify how much variation in expression
    levels we can expect between dishes. Let that quantity be the standard deviation
        of expression level difference due to the dish, which we will call sd(DV),
    with default at 10%. 

    

    For each dish Dj, we draw Dev(Dj) from a normal distribution with mean 100 and
    standard deviation = sd(DV), that is, N(100, sd(DV)). 

    

    We then adjust the initial^3 expression levels in all genes in all the cells
    in dish Dj by Dev(Dj): 

    

    For each dish Dj 
 For each cell Ck in Dj, 

     For each gene Gi in Ck at time 1 (Gi:1),
    
        
            Let Gi = Dev(Dj) % of Gi 
        
    

Dishes and Time

    

    In the usual studies, although several dishes might begin an experiment in a
    relatively "synchronized" state, each dish must be "frozen" before it can be
    processed for measurement, and thus we cannot use the same dish to measure cells
    at two different time points in the experiment. To measure two time points,
    we take a sample from one dish at time 1 and a sample from a different dish
    at time 2, e.g., Figure 29. 

    

    

    

    If the goal is to compare the expression levels of a particular gene at two
    different times to see if they substantially differ, then this constraint forces
    us to compare the level of a sample from one dish against a sample from another.
    This constraint will NOT be imposed by the simulator, but rather by the data
    analyst after data is generated and stored.
 


    

    6.2 Measurement Error

    

    After a dish is "frozen" at a time, its RNA is extracted. From this RNA, several
    samples can be drawn and each "measured" by exposing it to a microarray chip
    (chips can be re-used) and then digitally converting pixel intensities to gene
    expression levels (Figure 31). 

    

    

    

    

    Chips can be reused approximately four times. Samples from the same dish might
    vary slightly, chips definitely vary, the same chip functions differently from
    one use to another, and the pixel digitization routine might include error as
    well. Each "measurement" therefore, has five sources of potential error: 

    



    dish,
    sample,
    chip,
    re-use of a chip, and
    pixel digitization



    We discussed how we will model the dish variability above. In this version of
    the simulator, we will NOT model dish to dish variability beyond initialization.
    To model the other sources of measurement error, we will proceed as follows.

    

    After initializing the cells in an experiment, and adding dish to dish variabiltiy,
    we will have N cells partitioned into D dishes. After we run the experiment,
    in which each cell obeys the same causal laws but updates independently of each
    other, each cell has an expression level at each of t times. 
After Running the Experiment

    

     

    

    We call this the Raw data from the experiment.

    

    After RNA Extraction 

    

    Next, we model the RNA extraction process for each dish by aggregating all the
    cells in a dish, averaging the expression levels for each gene, and recording
    an average expression level for each gene in each dish at each time (Figure
    34):

    

    
  
The average expression level for a gene Gi at time t
    in a dish d, is:

     


    Adding Measurement Error There remain 4 sources of measurement error:
    sample, chip to chip, chip re-use, and pixel digitization error. In this version
    of the simulator - we will NOT model chip re-use variability. We treat each
    of the other as additive normal error with mean 0 and standard deviation 
    

    

    Let Sample to Sample Variability error 

    

    Let Chip to Chip error 

    

    Pixel digitization error 

    

    For each sample s taken from a dish d and measured, new values of 
    are drawn. For each gene in each sample, a new value of 
    is drawn. Thus the average measured expression level for each gene Gi at time
    t is:

    

    

    

    If we draw 4 samples from each dish, and don't re-use any chips, then our final
    data table would look as follows:

    

    

    

    We call this the Measurement Data.

    To summarize, the parameters the user must specify for the measurement error
    model, with defaults in brackets are:

    



    Dish
        to Dish variability = sd(DV) [10]
    Number of samples per dish = S [4]
    Sample to Sample Variability S,
        [.025]
    Chip
        to Chip Variability c,
        [.1]
    Pixel Digitization pd,
        [.025]

 

    Running a Simulation

    

    To initiate a data generation process, double click on the red-die on the arrow
    from an IM modelNode to a Data modelNode (). The program will only produce data if it
    has all the info it needs, otherwise it will prompt you. 

    

    

    

    Simulation Run Parameters:

    

    After clicking the red die, a dialogue box comes up showing the parameters you
    must specify, along with their default values (Figure 38).

    

    

    

    We discussed the measurement model parameters in section 6. For the Simulation
    Parameters, the number of dishes defaults to 1 and the number of cells per dish
    to 100,000. The number of time steps generated defaults at 4, and the user has
    the option of when to start storing data (default at time step 1). The user
    can also choose how often to store data, with the default every step. 

    

    Finally, because there are often thousands of cells and thousands of genes in
    each cell, many simulations will involve millions of gene expression levels
    at each time. Call this the raw data. The measured, as opposed to raw data,
    involves only a few samples from each dish, each of which contains as many measurements
    as there are genes. Since storing the raw data is expensive relative to the
    aggregated data with measurement error - we give the user a choice as to whether
    to store raw data, with the default = no. If the user chooses to store raw data,
    both data sets will be contained in the data nodes on the workbench in Tetrad.

 
 Glossary 

    Bayes Network 
    
definition

        

    
    
chip 
    
A micro-array 

        

    
    
chip to chip variability
    
    
definition

        

    
    
dish to dish variability
    
    
definition

        

    
    
edge 
    
definition

        

    
    
error term 
    
definition

        

    
    
Generally Adititve Model (GAM)
    
    
This model outputs an additive function that is not
        alwasys linear.

        

    
    
Initialization 
    
Synchonized: all copies of orginial 
    
Random: Different levels of gene expression 

        

    
    
Instantiated Model (IM) 
    
definition

        

    
    
Lag Graph 
    
definition

        

    
    
Maximum Lag 
    
definition

        

    
    
Measurement Data 
    
definition

        

    
    
Parametric Model (PM) 
    
definition

        

    
    
Pixel Dish 
    
definition

        

    
    
Rate Parameters 
    
definition

        

    
    
Raw Data 
    
definition

        

    
    
Regular Graph 
    
definition

        

    
    
Regulation 
    
definition

        

    
    
Sample to Sample variability
    
    
definition

        

    
    
SEM 
    
definition

        

    
    
Boolean Glass Gene Parametric
        Model 
    


        

    
    
General Glass Gene Parametric
        Model 
    


        

    
    
General Additive Model Gene Parametric
        Model 
    


        

    
    
Bayes Net Standard Error Model
    
    
definition

        

    
    
Time Series Graph 
    
definition

        

    
    
Update Function 
    
Defines how influence is transmitted to gene in current
        time, can include variables from past times

        

    
    
 

 References 
 Edwards, R., & Glass
    L. (2000). Combinatorial explosion in model gene networks., Chaos, V. 10, N.
    3, September., pp. 1-14.