All Downloads are FREE. Search and download functionalities are using the official Maven repository.

docs.javahelp.manual.boxes.search.pc.html Maven / Gradle / Ivy

There is a newer version: 7.6.6
Show newest version



    Search Algorithms: PC
    


Search Algorithms: PC


The PC algorithm is designed to search for causal explanations of observational or mixed observational and experimental data in which it may be assumed that the true causal hypothesis is acyclic and there is no hidden common cause between any two variables in the dataset. (It is also assumed that no relationship between variables in the data is deterministic--see PCD).

The algorithm operates by asking a conditional independence oracle to make judgements about the independence of pairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditional indepedence tests are available for datasets that consist either entirely of continuous variables or entirely of discrete variables; hence, datasets of these types can be used as input to the algorithm. As a way of getting one's head around how the algorithm should behave in the ideal, when independence tests always give correct answers, one may also use a DAG as an input to the algorithm, in which case graphical m-separation will be substituted for an actual independence test.

In the case where a continuous dataset is used as input, the available conditional independence tests assume that the direct causal influence of any variable on any other is linear and that the distribution of each variable is Normal.

Some of the above assumptions are not testable using observational data. They should come from prior knowledge or partial experiments.

Pseudocode for the version of PC implemented in Tetrad IV is given below. As shown in the pseudocode, the algorithm can be broken into two phases: an adjacency phase and an orientation phase. In the adjacency phase, a complete undirected graph over the variables is initially constructed and then edges X---Y are removed if some set S among either the adjacents of X or the adjacents of Y can be found (of a certain size, or "depth") such that I(X, Y | S). Once the adjacency structure over V has been well estimated by this procedure, an orientation phase is begun. The first step of the orientation phase is to examine unshielded triples and consider whether to orient them as colliderDiscovery. An unshielded triple is a triple <X, Y, Z> where X is adjacent to Y, Y is adjacent to Z, but X is not adjacent to Z. Since X is not adjacent to Z, the edge X---Z must have been removed during the adjacency search by conditioning on some set Sxz; <X, Y, Z> is oriented as a collider X-->Y<--Z just in case Y is not in this Sxz. Once all such unshielded triples have been oriented as colliderDiscovery by this rule that can be, a series of orientation rules is applied (in this case, the complete orientation rule set from Meek 1995) to orient any edges whose orientations are implied by previous orientations. The log of particular decisions the algorithm makes, as described above, when searching on an actual dataset is available through the Logging menu in the interface.


Entering PC parameters

Consider the following "true" causal hypothesis (a DAG):

When the PC algorithm is chosen from the Search dropdown, window appears in which on may enter an depErrorsAlpha value and edit knowledge. The depErrorsAlpha value is the significance level of the statistical test used as a conditional independence oracle for the algorithm. The default value is 0.05, although it is useful to experiment with different depErrorsAlpha levels to test the sensitivity of the analysis to this parameter. (Typical values for experimenting are 0.01, 0.05, and 0.10.)

PC is sensitive to background knowledge--that is, sensitive to specifications that certain edges are either required in the model or forbidden to be in the model. To edit this information, click the edit button for background knowledge and enter the information in that interface.

When parameters are set to their desired values, click "Execute" to run the algorithm. The output will be a CPDAG like the following:


Interpreting the output

The are basically two types of edges that can appear in PC output:

  • a directed edge:

    In this case, the PC algorithm deduced that A is a direct cause of B, i.e., the causal effect goes from A to B and it is not intermediated by any of the other observed variable

  • a undirected edge:

    In this case, the PC algorithm cannot tell if A causes B or if B causes A.

The absence of an edge between any pair of nodes means they are independent, or that the causal effect of one modelNode in the other is intermediate by other observed variables.

Sometimes a double directed edge sometimes appear in a PC search output. Such edges are the result of a partial failure of the PC search. They may appear due to failure of assumptions (e.g., relationships are non-linear, the population graph is cyclic, etc.) or because the sample is not large enough and some statistical decisions are inconsistent. In a situation like that, the user may introduce prior knowledge to constraint the direction such edge may assume, collect more data or use a different algorithm. Knowledge of the domain will be essential.

Finally, a triplet of nodes may assume the following CPDAG:

In other words, in such patterns, A and B are connected by an undirected edge, A and C are connected by an undirected edge, and B and C are not connected by an edge. By the PC search assumptions, this means that B and C cannot both be cause of A. The three possible scenarios are:

  • A is a common cause of B and C
  • B is a direct cause of A, and A is a direct cause of C
  • C is a direct cause of A, and A is a direct cause of B

In our example, some edges were compelled to be directed: X2 and X3 are causes of X4, and X4 is a cause of X5. However, we cannot tell much about the triplet (X1, X2, X3), but we know that X2 and X3 cannot both be causes of X1.

Pseudocode for PC

The following is pseudocode representing the way PC is implemented in Tetrad.

Step A:

Form the complete undirected graph G over v1,...,vn.

Step B (Fast Adjacency Search):

For each depth d = 0, 1, ...:
   For for each variable x:

      "next_y":
      For each adjacent modelNode y to v:
         Let adjX = adj(x) - {y}
         Let adjY = adj(y) - {x}

         For each subset Sx of adjX up to size d:
            If x _||_ y | Sx, remove x---y from G.
            Continue "next_y."

         For each subset Sy of adjY up to size d:
            if x _||_ y | Sy, remove x---y from G.
            Continue "next_y."


Step C:

Orient colliderDiscovery in G, as follows:

For each modelNode x:
   For each pair of nodes y, z adjacent to x:
      If y and z are not adjacent:
         If ~(y _||_ z | x):
            Orient y-->x<--z as a collider.

Step D:

Apply orientation rules until no more orientations are possible.
Rules to use: away from collider, away from cycle, kite1, kite2.
(These are Meek's rules R1, R2, R3, and R4.)

Away from collider:

For each modelNode a:
   For each b, c in adj(a):
      If b-->a---c:
         Orient b-->a-->c.
      Else if c-->a---b:
         Orient c-->a-->b.


Away from cycle:

For each modelNode a:
   For b, c in adj(a):
      If a-->b-->c and a---c:
         Orient a-->c.
      Else if c-->b-->a and c---a:
         Orient c-->a.

Kite 1:

For each modelNode a:
   For each nodes b, c, d in adj(a) such that a---b, a---c,
   a---d, and !(c---d):
      If c-->b and d-->b:
         Orient a-->b.


Kite 2:

For each modelNode a:
   For each nodes b, c, d in adj(a) such that a---b, a---d,
   b is not adjacent to d, and either a---c, a-->c, or c-->a,
      If b-->c and c-->d:
         Orient a-->d.
      Else if d-->c and c-->b:
         Orient a-->b.

 

References:

Spirtes, Glymour, and Scheines (2000). Causation, Prediction, and Search.

Chris Meek (1995), "Causal inference and causal explanation with background knowledge."





© 2015 - 2025 Weber Informatics LLC | Privacy Policy