docs.javahelp.manual.boxes.search.pc.html Maven / Gradle / Ivy

Go to download



    Search Algorithms: PC
    



    
    
        
            Search Algorithms: PC
        
    
    



The PC algorithm is designed to search for causal explanations of
    observational or mixed observational and experimental data in which it
    may be assumed that the true causal hypothesis is acyclic and there is
    no hidden common cause between any two variables in the dataset. (It
    is also assumed that no relationship between variables in the data is
    deterministic--see PCD).

The algorithm operates by asking a conditional independence oracle
    to make judgements about the independence of pairs of variables (e.g.,
    X, Z) conditional on sets of variables (e.g., {Y}). Conditional
    indepedence tests are available for datasets that consist either
    entirely of continuous variables or entirely of discrete variables;
    hence, datasets of these types can be used as input to the
    algorithm. As a way of getting one's head around how the algorithm
    should behave in the ideal, when independence tests always give
    correct answers, one may also use a DAG as an input to the algorithm,
    in which case graphical m-separation will be substituted for an actual
    independence test. 

In the case where a continuous dataset is used as input, the
    available conditional independence tests assume that the direct causal
    influence of any variable on any other is linear and that the
    distribution of each variable is Normal. 

Some of the above assumptions are not
    testable using observational data. They should come from prior
    knowledge or partial experiments.

Pseudocode for the version of PC implemented in Tetrad IV is given
    below. As shown in the pseudocode, the algorithm can be broken into
    two phases: an adjacency phase and an orientation phase. In the
    adjacency phase, a complete undirected graph over the variables is
    initially constructed and then edges X---Y are removed if some set S
    among either the adjacents of X or the adjacents of Y can be found (of
    a certain size, or "depth") such that I(X, Y | S). Once the
    adjacency structure over V has been well estimated by this procedure,
    an orientation phase is begun. The first step of the orientation phase
    is to examine unshielded triples and consider whether to orient them
    as colliderDiscovery. An unshielded triple is a triple <X, Y, Z> where X
    is adjacent to Y, Y is adjacent to Z, but X is not adjacent to
    Z. Since X is not adjacent to Z, the edge X---Z must have been removed
    during the adjacency search by conditioning on some set Sxz; <X, Y,
    Z> is oriented as a collider X-->Y<--Z just in case Y is not
    in this Sxz. Once all such unshielded triples have been oriented as
    colliderDiscovery by this rule that can be, a series of orientation rules is
    applied (in this case, the complete orientation rule set from Meek
    1995) to orient any edges whose orientations are implied by previous
    orientations. The log of particular decisions the algorithm makes, as
    described above, when searching on an actual dataset is available
    through the Logging menu in the interface. 


 Entering PC parameters

Consider the following "true" causal hypothesis (a
    DAG):

    

When the PC algorithm is chosen from the Search dropdown, window
    appears in which on may enter an depErrorsAlpha value and edit
    knowledge. The depErrorsAlpha value is the significance level of the
    statistical test used as a conditional independence oracle for the
    algorithm. The default value is 0.05, although it is useful to
    experiment with different depErrorsAlpha levels to test the sensitivity of the
    analysis to this parameter. (Typical values for experimenting are
    0.01, 0.05, and 0.10.)

PC is sensitive to background knowledge--that is, sensitive to
    specifications that certain edges are either required in the model or
    forbidden to be in the model. To edit this information, click the edit
    button for background knowledge and enter the information in that
    interface. 

When parameters are set to their desired values, click
    "Execute" to run the algorithm. The output will be a CPDAG
    like the following: 


 

Interpreting the
    output
The are basically two types of edges that can
    appear in PC output:

    a directed edge: 
        
        In this case, the PC algorithm deduced
            that A is a direct cause of B, i.e., the causal effect goes from A to B
            and it is not intermediated by any of the other observed variable
    
    a undirected edge:
        
        In this case, the PC algorithm cannot tell
            if A causes B or if B causes A.
    

The absence of an edge between any pair of
    nodes means they are independent, or that the causal effect of one modelNode
    in the other is intermediate by other observed variables.

Sometimes a double directed edge sometimes
    appear in a PC search output. Such edges are the result of a
    partial failure of the PC search. They may appear due to failure of
    assumptions (e.g., relationships are non-linear, the population graph
    is cyclic, etc.) or because the sample is not large enough and some
    statistical decisions are inconsistent. In a situation like that, the
    user may introduce prior knowledge to constraint the direction such
    edge may assume, collect more data or use a different
    algorithm. Knowledge of the domain will be essential.

Finally, a triplet of nodes may assume the following CPDAG:

    

In other words, in such patterns, A and B are connected by an
    undirected edge, A and C are connected by an undirected edge, and B and
    C are not connected by an edge. By the PC search assumptions, this
    means that B and C cannot both be cause of A. The three possible
    scenarios are:

    A is a common cause of B and C
    B is a direct cause of A, and A is a direct cause of C
    C is a direct cause of A, and A is a direct cause of B

In our example, some edges were compelled to be directed: X2 and X3
    are causes of X4, and X4 is a cause of X5. However, we cannot tell much
    about the triplet (X1, X2, X3), but we know that X2 and X3 cannot both
    be causes of X1.

    
Pseudocode for PC
    The following is pseudocode representing the way PC is implemented in Tetrad.
    Step A:

Form the complete undirected graph G over v1,...,vn.

Step B (Fast Adjacency Search):

For each depth d = 0, 1, ...:
   For for each variable x:

      "next_y":
      For each adjacent modelNode y to v:
         Let adjX = adj(x) - {y}
         Let adjY = adj(y) - {x}

         For each subset Sx of adjX up to size d:
            If x _||_ y | Sx, remove x---y from G.
            Continue "next_y."

         For each subset Sy of adjY up to size d:
            if x _||_ y | Sy, remove x---y from G.
            Continue "next_y."


Step C:

Orient colliderDiscovery in G, as follows:

For each modelNode x:
   For each pair of nodes y, z adjacent to x:
      If y and z are not adjacent:
         If ~(y _||_ z | x):
            Orient y-->x<--z as a collider.

Step D:

Apply orientation rules until no more orientations are possible.
Rules to use: away from collider, away from cycle, kite1, kite2.
(These are Meek's rules R1, R2, R3, and R4.)

Away from collider:

For each modelNode a:
   For each b, c in adj(a):
      If b-->a---c:
         Orient b-->a-->c.
      Else if c-->a---b:
         Orient c-->a-->b.


Away from cycle:

For each modelNode a:
   For b, c in adj(a):
      If a-->b-->c and a---c:
         Orient a-->c.
      Else if c-->b-->a and c---a:
         Orient c-->a.

Kite 1:

For each modelNode a:
   For each nodes b, c, d in adj(a) such that a---b, a---c,
   a---d, and !(c---d):
      If c-->b and d-->b:
         Orient a-->b.


Kite 2:

For each modelNode a:
   For each nodes b, c, d in adj(a) such that a---b, a---d,
   b is not adjacent to d, and either a---c, a-->c, or c-->a,
      If b-->c and c-->d:
         Orient a-->d.
      Else if d-->c and c-->b:
         Orient a-->b.


 
References: 
Spirtes, Glymour, and Scheines (2000). Causation, Prediction, and Search.
Chris Meek (1995), "Causal inference and causal explanation with background knowledge."