com.gemstone.gemfire.SystemFailure Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of gemfire-core Show documentation
SnappyData store based off Pivotal GemFireXD
There is a newer version: 2.0-BETA
/*
 * Copyright (c) 2010-2015 Pivotal Software, Inc. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you
 * may not use this file except in compliance with the License. You
 * may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 * implied. See the License for the specific language governing
 * permissions and limitations under the License. See accompanying
 * LICENSE file.
 */
package com.gemstone.gemfire;

import com.gemstone.gemfire.internal.LocalLogWriter;
import com.gemstone.gemfire.internal.LogWriterImpl;
import com.gemstone.gemfire.internal.LogWriterImpl.GemFireThreadGroup;
import com.gemstone.gemfire.internal.SystemFailureTestHook;
import com.gemstone.gemfire.internal.admin.remote.RemoteGfManagerAgent;
import com.gemstone.gemfire.internal.cache.GemFireCacheImpl;
import com.gemstone.gemfire.internal.i18n.LocalizedStrings;

import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;

/**
 * Catches and responds to JVM failure
 * 
 * This class represents a catastrophic failure of the system,
 * especially the Java virtual machine.  Any class may,
 * at any time, indicate that a system failure has occurred by calling
 * {@link #initiateFailure(Error)} (or, less commonly,
 * {@link #setFailure(Error)}).
 * 

 * In practice, the most common type of failure that is likely to be
 * reported by an otherwise healthy JVM is {@link OutOfMemoryError}.  However,
 * GemFire will report any occurrence of other {@link VirtualMachineError}s
 * excluding {@link StackOverflowError} as a JVM failure.
 * 

 * When a failure is reported, you must assume that the JVM has broken 
 * its fundamental execution contract with your application. 
 * No programming invariant can be assumed to be true, and your 
 * entire application must be regarded as corrupted.
 * 
Failure Hooks
 * GemFire uses this class to disable its distributed system (group
 * communication) and any open caches.  It also provides a hook for you
 * to respond to after GemFire disables itself.
 * Failure WatchDog
 * When {@link #startThreads()} is called, a "watchdog" {@link Thread} is started that 
 * periodically checks to see if system corruption has been reported.  When 
 * system corruption is detected, this thread proceeds to:
 * 
 * 

 * 
 * Close GemFire -- Group communication is ceased (this cache
 * member recuses itself from the distributed system) and the cache
 * is further poisoned (it is pointless to try to cleanly close it at this
 * point.).
 * 
 * After this has successfully ended, we launch a
 * 
 * 
 * failure action, a user-defined Runnable
 * {@link #setFailureAction(Runnable)}.
 * By default, this Runnable performs nothing.  If you feel you need to perform
 * an action before exiting the JVM, this hook gives you a
 * means of attempting some action.  Whatever you attempt should be extremely
 * simple, since your Java execution environment has been corrupted.
 * 
 * GemStone recommends that you employ 
 * 
 * Java Service Wrapper to detect when your JVM exits and to perform
 * appropriate failure and restart actions.
 * 
 * 
 * Finally, if the application has granted the watchdog permission to exit the JVM
 * (via {@link #setExitOK(boolean)}), the watchdog calls {@link System#exit(int)} with
 * an argument of 1.  If you have not granted this class permission to
 * close the JVM, you are strongly  advised to call it in your
 * failure action (in the previous step).
 * 
 * 
 *  
 * Each of these actions will be run exactly once in the above described
 * order.  However, if either step throws any type of error ({@link Throwable}), 
 * the watchdog will assume that the JVM is still under duress (esp. an 
 * {@link OutOfMemoryError}), will wait a bit, and then retry the failed action.
 * 

 * It bears repeating that you should be very cautious of any Runnables you
 * ask this class to run.  By definition the JVM is very sick
 * when failure has been signalled.  
 * 

 * 
Failure Proctor
 * In addition to the failure watchdog, {@link #startThreads()} creates a second
 * thread (the "proctor") that monitors free memory. It does this by examining
 * {@link Runtime#freeMemory() free memory},  
 * {@link Runtime#totalMemory() total memory} and 
 * {@link Runtime#maxMemory() maximum memory}.  If the amount of available 
 * memory stays below a given 
 * {@link #setFailureMemoryThreshold(long) threshold}, for
 * more than {@link #WATCHDOG_WAIT} seconds, the watchdog is notified.
 * 
 * Note that the proctor can be effectively disabled by
 * {@link SystemFailure#setFailureMemoryThreshold(long) setting} the failure memory threshold
 * to a negative value.
 * 

 * The proctor is a second line of defense, attempting to detect 
 * OutOfMemoryError conditions in circumstances where nothing alerted the
 * watchdog.  For instance, a third-party jar might incorrectly handle this
 * error and leave your virtual machine in a "stuck" state.
 * 

 * Note that the proctor does not relieve you of the obligation to
 * follow the best practices in the next section.
 * 
Best Practices
 * Catch and Handle fatal JVM errors
 * If you feel obliged to catch either {@link Error}, or 
 * {@link Throwable}, you mustalso check for 
 * fatal JVM error like so:
 * 
 * 
        catch (Error e) {
          if (SystemFailure.{@link #isJVMFailureError(Error) isJVMFailureError}(e)) {
            SystemFailure.{@link #initiateFailure(Error) initiateFailure}(e);
            // If this ever returns, rethrow the error. We're poisoned
            // now, so don't let this thread continue.
            throw e;
          }
          ...
        }
 * 
 * Periodically Check For Errors
 * Check for serious system errors at
 * appropriate points in your algorithms.  You may elect to use
 * the {@link #checkFailure()} utility function, but you are
 * not required to (you could just see if {@link SystemFailure#getFailure()}
 * returns a non-null result).  
 * 
 * A job processing loop is a good candidate, for
 * instance, in com.gemstone.org.jgroups.protocols.UDP#run(), 
 * which implements {@link Thread#run}:
 * 

 * 
         for (;;)  {
           SystemFailure.{@link #checkFailure() checkFailure}();
           if (mcast_recv_sock == null || mcast_recv_sock.isClosed()) break;
           if (Thread.currentThread().isInterrupted()) break;
          ...
 * 
 * Create Logging ThreadGroups
 * If you create any Thread, a best practice is to catch severe errors
 * and signal failure appropriately.  One trick to do this is to create a 
 * ThreadGroup that handles uncaught exceptions by overriding 
 * {@link ThreadGroup#uncaughtException(Thread, Throwable)} and to declare 
 * your thread as a member of that {@link ThreadGroup}.  This also has a 
 * significant side-benefit in that most uncaught exceptions 
 * can be detected:
 * 
 * 
    ThreadGroup tg = new ThreadGroup("Worker Threads") {
        public void uncaughtException(Thread t, Throwable e) {
          // Do this *before* any object allocation in case of
          // OutOfMemoryError (for instance)
          if (e instanceof Error && SystemFailure.{@link #isJVMFailureError(Error) isJVMFailureError}(
              (Error)e)) {
            SystemFailure.{@link #setFailure(Error) setFailure}((Error)e); // don't throw
          }
          String s = "Uncaught exception in thread " + t;
          system.getLogWriter().severe(s, e);
        }
        Thread t = new Thread(myRunnable, tg, "My Thread");
        t.start();
      }; * 
 * 
  * 
Catches of Error and Throwable Should Check for Failure
  * Keep in mind that peculiar or flat-outimpossible  exceptions may 
  * ensue after a fatal JVM error has been thrown anywhere in
  * your virtual machine. Whenever you catch {@link Error} or {@link Throwable}, 
  * you should also make sure that you aren't dealing with a corrupted JVM:
  * 
  * 
        catch (Throwable t) {
          Error err;
          if (t instanceof Error && SystemFailure.{@link #isJVMFailureError(Error) isJVMFailureError}(
              err = (Error)t)) {
            SystemFailure.{@link #initiateFailure(Error) initiateFailure}(err);
            // If this ever returns, rethrow the error. We're poisoned
            // now, so don't let this thread continue.
            throw err;
          }
          // Whenever you catch Error or Throwable, you must also
          // check for fatal JVM error (see above).  However, there is
          // _still_ a possibility that you are dealing with a cascading
          // error condition, so you also need to check to see if the JVM
          // is still usable:
          SystemFailure.{@link #checkFailure() checkFailure}();
          ...
        }
 * 
 * @author jpenney
 * @author swale
 * @since 5.1
 */
@SuppressFBWarnings(value="DM_GC", justification="This class performs System.gc as last ditch effort during out-of-memory condition.") 
public final class SystemFailure {

  /**
   * Preallocated error messages\
   * LocalizedStrings may use memory (in the form of an iterator)
   * so we must get the translated messages in advance.
   **/
   static final String JVM_CORRUPTION = LocalizedStrings.SystemFailure_JVM_CORRUPTION_HAS_BEEN_DETECTED.toLocalizedString();
   static final String CALLING_SYSTEM_EXIT = LocalizedStrings.SystemFailure_SINCE_THIS_IS_A_DEDICATED_CACHE_SERVER_AND_THE_JVM_HAS_BEEN_CORRUPTED_THIS_PROCESS_WILL_NOW_TERMINATE_PERMISSION_TO_CALL_SYSTEM_EXIT_INT_WAS_GIVEN_IN_THE_FOLLOWING_CONTEXT.toLocalizedString();
   public static final String DISTRIBUTION_HALTED_MESSAGE = LocalizedStrings.SystemFailure_DISTRIBUTION_HALTED_DUE_TO_JVM_CORRUPTION.toLocalizedString();
   public static final String DISTRIBUTED_SYSTEM_DISCONNECTED_MESSAGE = LocalizedStrings.SystemFailure_DISTRIBUTED_SYSTEM_DISCONNECTED_DUE_TO_JVM_CORRUPTION.toLocalizedString();

  /**
   * the underlying failure
   * 
   * This is usually an instance of {@link VirtualMachineError}, but it
   * is not required to be such.
   * 
   * @see #getFailure()
   * @see #initiateFailure(Error)
   */
  protected static volatile Error failure = null;
  
  /**
   * user-defined runnable to run last
   * 
   * @see #setFailureAction(Runnable)
   */
  private static volatile Runnable failureAction = new Runnable() {
    public void run() {
      System.err.println(JVM_CORRUPTION);
      failure.printStackTrace();
    }
  };
  
  /**
   * @see #setExitOK(boolean)
   */
  private static volatile boolean exitOK = false;
  
  /**
   * If we're going to exit the JVM, I want to be accountable for who
   * told us it was OK.
   */
  private static volatile Throwable exitExcuse;
  
  /**
   * Indicate whether it is acceptable to call {@link System#exit(int)} after
   * failure processing has completed.
   * 
   * This may be dynamically modified while the system is running.
   * 
   * @param newVal true if it is OK to exit the process
   * @return the previous value
   */
  public static boolean setExitOK(boolean newVal) {
    boolean result = exitOK;
    exitOK = newVal;
    if (exitOK) {
      exitExcuse = new Throwable("SystemFailure exitOK set");
    }
    else {
      exitExcuse = null;
    }
    return result;
  }

  /**
   * Returns true if the given Error is a fatal to the JVM and it should be shut
   * down. Code should call {@link #initiateFailure(Error)} or
   * {@link #setFailure(Error)} if this returns true.
   */
  public static boolean isJVMFailureError(Error err) {
    // all VirtualMachineErrors are not fatal to the JVM, in particular
    // StackOverflowError is not
    if (err instanceof OutOfMemoryError) {
      // ignore OOMEs thrown by Spark
      String message = err.getMessage();
      return !message.contains("Unable to acquire") &&
          !message.contains("error while calling spill") &&
          !message.contains("enough memory for aggregation") &&
          !message.contains("enough memory to grow");
    } else {
      return false;
    }
  }

  /**
   * Check to see if a throwable is a JVM error and handle it if so.
   */
  public static void checkThrowable(Throwable e) {
    Error err;
    if (e instanceof Error && SystemFailure.isJVMFailureError(
        err = (Error)e)) {
      SystemFailure.initiateFailure(err);
      // If this ever returns, rethrow the error. We're poisoned
      // now, so don't let this thread continue.
      throw err;
    }
    // Whenever you catch Error or Throwable, you must also
    // check for fatal JVM error (see above).  However, there is
    // _still_ a possibility that you are dealing with a cascading
    // error condition, so you also need to check to see if the JVM
    // is still usable:
    SystemFailure.checkFailure();
  }

  /**
   * Disallow instance creation
   */
  private SystemFailure() {
    
  }
  
  /**
   * Synchronizes access to state variables, used to notify the watchdog
   * when to run
   */
  private static final Object failureSync = new Object();
  
  /**
   * True if we have closed GemFire
   * 
   * @see #emergencyClose()
   */
  private static volatile boolean gemfireCloseCompleted = false;

  /**
   * True if we have completed the user-defined failure action
   * 
   * @see #setFailureAction(Runnable)
   */
  private static volatile  boolean failureActionCompleted = false;

  /**
   * This is a logging ThreadGroup, created only once.
   */
  private final static ThreadGroup tg;
  static {
    tg = new GemFireThreadGroup("SystemFailure Watchdog Threads") {
      // If the watchdog is correctly written, this will never get executed.
      // However, there's no reason for us not to eat our own dog food
      // (har, har) -- see the javadoc above.
      @Override
      public void uncaughtException(Thread t, Throwable e) {
        // Uhhh...if the watchdog is running, we *know* there's some
        // sort of serious error, no need to check for it here.
        System.err.println("Internal error in SystemFailure watchdog:" + e);
        e.printStackTrace();
      }
      };  
    }
  
  /**
   * This is the amount of time, in seconds, the watchdog periodically awakens
   * to see if the system has been corrupted.
   * 

   * The watchdog will be explicitly awakened by calls to
   * {@link #setFailure(Error)} or {@link #initiateFailure(Error)}, but
   * it will awaken of its own accord periodically to check for failure even
   * if the above calls do not occur.
   * 

   * This can be set with the system property 
   * gemfire.WATCHDOG_WAIT. The default is 15 sec.
   */
  static public final int WATCHDOG_WAIT = Integer
      .getInteger("gemfire.WATCHDOG_WAIT", 15).intValue();
  
  /**
   * This is the watchdog thread
   * 
   * @guarded.By {@link #failureSync}
   */
  private static Thread watchDog;
  
  /**
   * Start the watchdog thread, if it isn't already running.
   */
  private static void startWatchDog() {
    if (failureActionCompleted) {
      // Our work is done, don't restart
      return;
    }
    synchronized (failureSync) {
      if (watchDog != null && watchDog.isAlive()) {
        return;
      }
      watchDog = new Thread(tg, new Runnable() {
        public void run() {
          runWatchDog();
        }
      }, "SystemFailure WatchDog");
      watchDog.setDaemon(true);
      watchDog.start();
    }
  }
  
  private static void stopWatchDog() {
    synchronized (failureSync) {
      stopping = true;
      if (watchDog != null && watchDog.isAlive()) {
        failureSync.notifyAll();
        try {
          watchDog.join(100);
        } catch (InterruptedException ignore) {
        }
        if (watchDog.isAlive()) {
          watchDog.interrupt();
          try {
            watchDog.join(1000);
          } catch (InterruptedException ignore) {
          }
        }
      }
      watchDog = null;
    }
  }
  
  /**
   * This is the run loop for the watchdog thread.
   */
  static protected void runWatchDog() {
    
    boolean warned = false;
    
    logFine(WATCHDOG_NAME, "Starting");
    try {
      basicLoadEmergencyClasses();
    }
    catch (ExceptionInInitializerError e) {
      // Uhhh...are we shutting down?
      boolean noSurprise = false;
      Throwable cause = e.getCause();
      if (cause != null) {
        if (cause instanceof IllegalStateException) {
          String msg = cause.getMessage();
          if (msg.indexOf("Shutdown in progress") >= 0) {
            noSurprise = true;
          }
        }
      }
      if (!noSurprise) {
        logWarning(WATCHDOG_NAME, "Unable to load GemFire classes: ", e);
      }
      // In any event, we're toast
      return;
    }
    catch (CancelException e) {
      // ignore this because we are shutting down anyway
    }
    catch (Throwable t) {
      logWarning(WATCHDOG_NAME, "Unable to initialize watchdog", t);
      return;
    }
    for (;;) {
      if (stopping) {
        return;
      }
      try {
        // Sleep or get notified...
        synchronized (failureSync) {
          if (stopping) {
            return;
          }
         logFine(WATCHDOG_NAME, "Waiting for disaster");
          try {
            failureSync.wait(WATCHDOG_WAIT * 1000); 
          }
          catch (InterruptedException e) {
            // Ignore
          }
          if (stopping) {
            return;
          }
        }
        // Poke nose in the air, take a sniff...
        
        if (failureActionCompleted) {
          // early out, for testing
          logInfo(WATCHDOG_NAME, "all actions completed; exiting");
        }
        if (failure == null) {
          // Tail wag.  Go back to sleep.
          logFine(WATCHDOG_NAME, "no failure detected");
          continue;
        }
        // BOW WOW WOW WOW WOW!  Corrupted system.
        if (!warned ) {
          warned = logWarning(WATCHDOG_NAME, "failure detected", failure);
        }
        
        // If any of the following fail, we will go back to sleep and
        // retry.
        if (!gemfireCloseCompleted) {
          logInfo(WATCHDOG_NAME, "closing GemFire");
          try {
            emergencyClose();
          }
          catch (Throwable t) {
            logWarning(WATCHDOG_NAME, "trouble closing GemFire", t);
            continue; // go back to sleep
          }
          gemfireCloseCompleted = true;
        }
        
        if (!failureActionCompleted) {
          // avoid potential race condition setting the runnable
          Runnable r = failureAction;
          if (r != null) {
            logInfo(WATCHDOG_NAME, "running user's runnable");
            try {
              r.run();
            }
            catch (Throwable t) {
              logWarning(WATCHDOG_NAME, "trouble running user's runnable", t);
              continue; // go back to sleep
            }
          }
          failureActionCompleted = true;
        }
        
        stopping = true;
        stopProctor();
        
        if (exitOK) {
          logWarning(WATCHDOG_NAME,
              // No "+" in this long message, we're out of memory!
              CALLING_SYSTEM_EXIT,
              exitExcuse);
          
          // ATTENTION: there are VERY FEW places in GemFire where it is
          // acceptable to call System.exit.  This is one of those
          // places...
          System.exit(1);
        }


        // Our job here is done
        logInfo(WATCHDOG_NAME, "exiting");
        return;
      }
      catch (Throwable t) {
        // We *never* give up. NEVER EVER!
        logWarning(WATCHDOG_NAME, "thread encountered a problem: " + t, t);
      }
    } // for
  }
  
  /**
   * Spies on system statistics looking for low memory threshold
   * 
   * Well, if you're gonna have a watchdog, why not a watch CAT????
   * 
   * @guarded.By {@link #failureSync}
   * @see #minimumMemoryThreshold
   */
  private static Thread proctor;

  /**
   * This mutex controls access to {@link #firstStarveTime} and
   * {@link #minimumMemoryThreshold}.
   * 

   * I'm hoping that a fat lock is never created here, so that
   * an object allocation isn't necessary to acquire this
   * mutex.  You'd have to have A LOT of contention on this mutex
   * in order for a fat lock to be created, which indicates IMHO
   * a serious problem in your applications.
   */
  private static final Object memorySync = new Object();
  
  /**
   * This is the minimum amount of memory that the proctor will
   * tolerate before declaring a system failure.
   * 
   * @see #setFailureMemoryThreshold(long)
   * @guarded.By {@link #memorySync}
   */
  static long minimumMemoryThreshold = Long.getLong(
      "gemfire.SystemFailure.chronic_memory_threshold", 1048576).longValue();
  
  /**
   * This is the interval, in seconds, that the proctor
   * thread will awaken and poll system free memory.
   * 
   * The default is 1 sec.  This can be set using the system property
   * gemfire.SystemFailure.MEMORY_POLL_INTERVAL.
   * 
   * @see #setFailureMemoryThreshold(long)
   */
  static final public long MEMORY_POLL_INTERVAL = Long.getLong(
      "gemfire.SystemFailure.MEMORY_POLL_INTERVAL", 1).longValue();
  
  /**
   * This is the maximum amount of time, in seconds, that the proctor thread
   * will tolerate seeing free memory stay below
   * {@link #setFailureMemoryThreshold(long)}, after which point it will 
   * declare a system failure.
   * 
   * The default is 15 sec.  This can be set using the system property
   * gemfire.SystemFailure.MEMORY_MAX_WAIT.
   * 
   * @see #setFailureMemoryThreshold(long)
   */
  static final public long MEMORY_MAX_WAIT = Long.getLong(
      "gemfire.SystemFailure.MEMORY_MAX_WAIT", 15).longValue();
  
  /**
   * Flag that determines whether or not we monitor memory on our own.
   * If this flag is set, we will check freeMemory, invoke GC if free memory 
   * gets low, and start throwing our own OutOfMemoryException if 
   * 
   * The default is false, so this monitoring is turned off. This monitoring has been found 
   * to be unreliable in non-Sun VMs when the VM is under stress or behaves in unpredictable ways.
   *
   * @since 6.5
   */
  static final public boolean MONITOR_MEMORY = Boolean.getBoolean(
      "gemfire.SystemFailure.MONITOR_MEMORY");
  
  /**
   * Start the proctor thread, if it isn't already running.
   * 
   * @see #proctor
   */
  private static void startProctor() {
    if (failure != null) {
      // no point!
      notifyWatchDog(failure);
      return;
    }
    synchronized (failureSync) {
      if (proctor != null && proctor.isAlive()) {
        return;
      }
      proctor = new Thread(tg, new Runnable() {
        public void run() {
          runProctor();
        }
      }, "SystemFailure Proctor");
      proctor.setDaemon(true);
      proctor.start();
    }
  }
  
  private static void stopProctor() {
    synchronized (failureSync) {
      stopping = true;
      if (proctor != null && proctor.isAlive()) {
        proctor.interrupt();
        try {
          proctor.join(1000);
        } catch (InterruptedException ignore) {
        }
      }
      proctor = null;
    }
  }
  
  /**
   * Symbolic representation of an invalid starve time
   */
  static private final long NEVER_STARVED = Long.MAX_VALUE;
  
  /**
   * this is the last time we saw memory starvation
   * 
   * @guarded.By {@link #memorySync}}}
   */
  static private long firstStarveTime = NEVER_STARVED;
  
  /**
   * This is the previous measure of total memory.  If it changes,
   * we reset the proctor's starve statistic.
   */
  static private long lastTotalMemory = 0;
  
  /**
   * This is the run loop for the proctor thread (formally known
   * as the "watchcat" (grin)
   */
  static protected void runProctor() {
    // Note that the javadocs say this can return Long.MAX_VALUE.
    // If it does, the proctor will never do its job...
    final long maxMemory = Runtime.getRuntime().maxMemory();
    
    // Allocate this error in advance, since it's too late once
    // it's been detected!
    final OutOfMemoryError oome = new OutOfMemoryError(LocalizedStrings.SystemFailure_0_MEMORY_HAS_REMAINED_CHRONICALLY_BELOW_1_BYTES_OUT_OF_A_MAXIMUM_OF_2_FOR_3_SEC.toLocalizedString(new Object[] {PROCTOR_NAME, Long.valueOf(minimumMemoryThreshold), Long.valueOf(maxMemory), Integer.valueOf(WATCHDOG_WAIT)}));
    
    // Catenation, but should be OK when starting up
    logFine(PROCTOR_NAME, "Starting, threshold = " + minimumMemoryThreshold 
        + "; max = " + maxMemory);
    for (;;) {
      if (stopping) {
        return;
      }

      try {
        //*** catnap...
        try {
          Thread.sleep(MEMORY_POLL_INTERVAL * 1000);
        }
        catch (InterruptedException e) {
          // ignore
        }
        
        if (stopping) {
          return;
        }

        //*** Twitch ear, take a bath...
        if (failureActionCompleted) {
          // it's all over, we're late
          return;
        }
        if (failure != null) {
          notifyWatchDog(failure); // wake the dog, just in case
          logFine(PROCTOR_NAME, "Failure has been reported, exiting");
          return;
        }
        
        if(!MONITOR_MEMORY) {
          continue;
        }

        //*** Sit up, stretch...
        long totalMemory = Runtime.getRuntime().totalMemory();
        if (totalMemory < maxMemory) {
          // We haven't finished growing the heap, so no worries...yet
          if (DEBUG) {
            // This message has catenation, we don't want this in
            // production code :-)
            logFine(PROCTOR_NAME, "totalMemory (" + totalMemory 
                + ") < maxMemory (" + maxMemory + ")");
          }
          firstStarveTime = NEVER_STARVED;
          continue;
        }
        if (lastTotalMemory < totalMemory) {
          // Don't get too impatient if the heap just now grew
          lastTotalMemory = totalMemory; // now we're maxed
          firstStarveTime = NEVER_STARVED; // reset the clock
          continue;
        }
        lastTotalMemory = totalMemory; // make a note of this

        //*** Hey, is that the food bowl?
        
        // At this point, freeMemory really indicates how much
        // trouble we're in.
        long freeMemory = Runtime.getRuntime().freeMemory();
        if(freeMemory==0) {
          /*
           * This is to workaround X bug #41821 in JRockit.
           * Often, Jrockit returns 0 from Runtime.getRuntime().freeMemory()
           * Allocating this one object and calling again seems to workaround the problem.
           */
          new Object();
          freeMemory = Runtime.getRuntime().freeMemory();
        }
        // Grab the threshold and starve time once, under mutex, because
        // it's publicly modifiable.
        long curThreshold;
        long lastStarveTime;
        synchronized (memorySync) {
          curThreshold = minimumMemoryThreshold;
          lastStarveTime = firstStarveTime;
        }
        
        if (freeMemory >= curThreshold /* enough memory */
            || curThreshold == 0 /* disabled */) {
          // Memory is FINE, reset everything
          if (DEBUG) {
            // This message has catenation, we don't want this in
            // production code :-)
            logFine(PROCTOR_NAME, "Current free memory is: " + freeMemory);
          }
          
          if (lastStarveTime != NEVER_STARVED) {
            logFine(PROCTOR_NAME, "...low memory has self-corrected.");
          }
          synchronized (memorySync) {
            firstStarveTime = NEVER_STARVED;
          }
          continue;
        }
        // Memory is low
        
        //*** Leap to feet, nose down, tail switching...
        long now = System.currentTimeMillis();
        if (lastStarveTime == NEVER_STARVED) {
          // first sighting
          if (DEBUG) {
            // Catenation in this message, don't put in production
            logFine(PROCTOR_NAME, "Noting current memory " + freeMemory 
                + " is less than threshold " + curThreshold);
          }
          else {
            logWarning(
                PROCTOR_NAME,
                "Noting that current memory available is less than the currently designated threshold", null);
          }

          synchronized (memorySync) {
            firstStarveTime = now;
          }
          // Trust the JVM do a full gc when it is needed
          //System.gc(); // at least TRY...
          continue;
        }
        
        //*** squirm, wait for the right moment...wait...wait...
        if (now - lastStarveTime < MEMORY_MAX_WAIT * 1000) {
          // Very recent; problem may correct itself.
          if (DEBUG) {
            // catenation
            logFine(PROCTOR_NAME, "...memory is still below threshold: "
                + freeMemory);
          }
          else {
            logWarning(
                PROCTOR_NAME,
                "Noting that current memory available is still below currently designated threshold", null);

          }
          continue;
        }
        
        //*** Meow! Meow! MEOWWWW!!!!!
        
        // Like any smart cat, let the Dog do all the work.
        logWarning(PROCTOR_NAME, "Memory is chronically low; setting failure!", null);
        SystemFailure.setFailure(oome);
        notifyWatchDog(failure);
        return; // we're done!
      }
      catch (Throwable t) {
        logWarning(PROCTOR_NAME, "thread encountered a problem", t);
        // We *never* give up. NEVER EVER!
      }
    } // for
  }
  
  /**
   * Enables some fine logging
   */
  static private final boolean DEBUG = false;
  
  /**
   * If true, we track the progress of emergencyClose
   * on System.err
   */
  static public final boolean TRACE_CLOSE = false;
  
  /**
   * the level at which to log
   */
  static private final int LOG_LEVEL =
    DEBUG ? LogWriterImpl.FINE_LEVEL : LogWriterImpl.INFO_LEVEL;
  
  /**
   * This is a desperation logger that prints to System.out.
   */
  static protected final LogWriterImpl log 
      = new LocalLogWriter(LOG_LEVEL, System.out);

  static protected final String WATCHDOG_NAME = "SystemFailure Watchdog";
  
  static protected final String PROCTOR_NAME = "SystemFailure Proctor";
  
  /**
   * break any potential circularity in {@link #loadEmergencyClasses()}
   */
  private static volatile boolean emergencyClassesLoaded = false;
  
  /**
   * Since it requires object memory to unpack a jar file,
   * make sure this JVM has loaded the classes necessary for
   * closure before it becomes necessary to use them.
   * 

   * Note that just touching the class in order to load it
   * is usually sufficient, so all an implementation needs
   * to do is to reference the same classes used in
   * {@link #emergencyClose()}.  Just make sure to do it while
   * you still have memory to succeed!
   */
  public static void loadEmergencyClasses() {
    // This method was called to basically load this class
    // and invoke its static initializers. Now that we don't
    // use statics to start the threads all we need to do is
    // call startThreads. The watchdog thread will call basicLoadEmergencyClasses.
    startThreads();
  }
  private static void basicLoadEmergencyClasses() {
    if (emergencyClassesLoaded) return;
    emergencyClassesLoaded = true;
    SystemFailureTestHook.loadEmergencyClasses(); // bug 50516
    GemFireCacheImpl.loadEmergencyClasses();
    RemoteGfManagerAgent.loadEmergencyClasses();
  }
  
  /**
   * Attempt to close any and all GemFire resources.
   * 
   * The contract of this method is that it should not
   * acquire any synchronization mutexes nor create any objects.
   * 

   * The former is because the system is in an undefined state and
   * attempting to acquire the mutex may cause a hang.
   * 

   * The latter is because the likelihood is that we are invoking
   * this method due to memory exhaustion, so any attempt to create
   * an object will also cause a hang.
   * 

   * This method is not meant to be called directly (but, well, I
   * guess it could).  It is public to document the contract
   * that is implemented by emergencyClose in other
   * parts of the system.
   */
  public static void emergencyClose() {
    // Make the cache (more) useless and inaccessible...
    if (TRACE_CLOSE) {
      System.err.println("SystemFailure: closing GemFireCache");
    }
    GemFireCacheImpl.emergencyClose();
    
    // Arcane strange DS's exist in this class:
    if  (TRACE_CLOSE) {
      System.err.println("SystemFailure: closing admins");
    }
    RemoteGfManagerAgent.emergencyClose();
    
    // If memory was the problem, make an explicit attempt at
    // this point to clean up.
    
    // Trust the JVM do a full gc when it is needed
    //System.gc(); //  This will fail if we're out of memory?/

    if (TRACE_CLOSE)  {
      System.err.println("SystemFailure: end of emergencyClose");
    }
  }

  /**
   * Throw the system failure.
   * 
   * This method does not return normally.
   * 

   * Unfortunately, attempting to create a new Throwable at this
   * point may cause the thread to hang (instead of generating
   * another OutOfMemoryError), so we have to make do with whatever
   * Error we have, instead of wrapping it with one pertinent
   * to the current context.  See bug 38394.
   *
   * @throws Error
   */
  static private void throwFailure() throws InternalGemFireError, Error {
    // Do not return normally...
    if (failure != null) throw failure;
  }
  
  /**
   * Notifies the watchdog thread (assumes that {@link #failure} has been set)
   */
  private static void notifyWatchDog(Error err) {
    startWatchDog(); // just in case
    synchronized (failureSync) {
      failure = err; // We (re)set failure here to make findbugs happy.
      failureSync.notifyAll();
    }
  }
  
  /**
   * Utility function to check for failures.  If a failure is
   * detected, this methods throws an AssertionFailure.
   * 
   * @see #initiateFailure(Error)
   * @throws InternalGemFireError if the system has been corrupted
   * @throws Error if the system has been corrupted and a thread-specific 
   * AssertionError cannot be allocated
   */
  public static void checkFailure() throws InternalGemFireError, Error {
    if (failure == null) {
      return;
    }
    notifyWatchDog(failure);
    throwFailure();
  }

  /**
   * Signals that a system failure has occurred and then throws an
   * AssertionError.
   * 
   * @param f the failure to set
   * @throws IllegalArgumentException if f is null
   * @throws InternalGemFireError always; this method does not return normally.
   * @throws Error if a thread-specific AssertionError cannot be allocated.
   */
  public static void initiateFailure(Error f) throws InternalGemFireError, Error {
    SystemFailure.setFailure(f);
    throwFailure();
  }

  /**
   * Set the underlying system failure, if not already set.
   * 

   * This method does not generate an error, and should only be used
   * in circumstances where execution needs to continue, such as when
   * re-implementing {@link ThreadGroup#uncaughtException(Thread, Throwable)}.
   * 
   * @param failure the system failure
   * @throws IllegalArgumentException if you attempt to set the failure to null
   */
  public static void setFailure(Error failure) {
    if (failure == null) {
      throw new IllegalArgumentException(LocalizedStrings.SystemFailure_YOU_ARE_NOT_PERMITTED_TO_UNSET_A_SYSTEM_FAILURE.toLocalizedString());
    }
    if (SystemFailureTestHook.errorIsExpected(failure)) {
      return;
    }
    // created (OutOfMemoryError), and no stack frames are created
    // (StackOverflowError).  There is a slight chance that the
    // very first error may get overwritten, but this avoids the
    // potential of object creation via a fat lock
    SystemFailure.failure = failure;
    notifyWatchDog(failure);
  }
  
  /**
   * Returns the catastrophic system failure, if any.
   * 

   * This is usually (though not necessarily) an instance of
   * {@link VirtualMachineError}.
   * 

   * A return value of null indicates that no system failure has yet been
   * detected.
   * 

   * Object synchronization can implicitly require object creation (fat locks 
   * in JRockit for instance), so the underlying value is not synchronized
   * (it is a volatile). This means the return value from this call is not 
   * necessarily the first failure reported by the JVM.
   * 

   * Note that even if it were synchronized, it would only be a 
   * proximal indicator near the time that the JVM crashed, and may not 
   * actually reflect the underlying root cause that generated the failure.  
   * For instance, if your JVM is running short of memory, this Throwable is 
   * probably an innocent victim and not the actual allocation (or 
   * series of allocations) that caused your JVM to exhaust memory.
   * 

   * If this function returns a non-null value, keep in mind that the JVM is
   * very limited.  In particular, any attempt to allocate objects may fail
   * if the original failure was an OutOfMemoryError.  
   * 
   * @return the failure, if any
   */
  public static Error getFailure() {
    return failure;
  }

  /**
   * Sets a user-defined action that is run in the event
   * that failure has been detected.
   * 
 
   * This action is run after the GemFire cache has been shut down.
   * If it throws any error, it will be reattempted indefinitely until it
   * succeeds. This action may be dynamically modified while the system
   * is running.
   * 
   * The default action prints the failure stack trace to System.err.
   * 
   * @see #initiateFailure(Error)
   * @param action the Runnable to use
   * @return the previous action
   */
  public static Runnable setFailureAction(Runnable action) {
    Runnable old = SystemFailure.failureAction;
    SystemFailure.failureAction = action;
    return old;
  }

  /**
   * Set the memory threshold under which system failure will be
   * notified. 
   * 
   * This value may be dynamically  modified while the system
   * is running.  The default is 1048576 bytes.  This can be set using the 
   * system property gemfire.SystemFailure.chronic_memory_threshold.
   * 
   * @param newVal threshold in bytes
   * @return the old threshold
   * @see Runtime#freeMemory()
   */
  public static long setFailureMemoryThreshold(long newVal) {
    long result;
    synchronized (memorySync) {
      result = minimumMemoryThreshold;
      minimumMemoryThreshold = newVal;
      firstStarveTime = NEVER_STARVED; // reset
    }
    startProctor(); // just in case
    return result;
  }
  
//  /**
//   * For use by GemStone Quality Assurance Only
//   * 
//   * @deprecated TODO remove this
//   */
//  public static void reset() {
//    System.gc();
//    logWarning("DJP", "do not commit SystemFailure#reset", null);
//    failure = null;
//    failureAction = new Runnable() {
//      public void run() {
//        System.err.println("(SystemFailure) JVM corruption has been detected!");
//        failure.printStackTrace();
//      }
//    };
//    gemfireCloseCompleted = false;
//    failureActionCompleted = false;
//    synchronized (failureSync) {
//      if (watchDog != null) {
//        watchDog.interrupt();
//      }
//      watchDog = null;
//      if (watchCat != null) {
//        watchCat.interrupt();
//      }
//      watchCat = null;
//    }
//
//    startWatchDog();
//    startWatchCat();
//  }
  
  static private boolean logStdErr(String kind, String name, String s, Throwable t) {
    // As far as I can tell, this code path doesn't allocate
    // any objects!!!!
    try {
      System.err.print(name);
      System.err.print(": [");
      System.err.print(kind);
      System.err.print("] ");
      System.err.println(s);
      if (t != null) {
        t.printStackTrace();
      }
      return true;
    }
    catch (Throwable t2) {
      // out of luck
      return false;
    }
  }
  
  /**
   * Logging can require allocation of objects, so we wrap the
   * logger so that failures are silently ignored.
   * 
   * @param s string to print
   * @param t the call stack, if any
   * @return true if the warning got printed
   */
  static protected boolean logWarning(String name, String s, Throwable t) {
    return logStdErr("warning", name, s, t);
//    if (PREFER_STDERR) {
//      return logStdErr("warning", name, s, t);
//    }
//    try {
//      log.warning(name + ": " + s, t);
//      return true;
//    }
//    catch (Throwable t2) {
//      return logStdErr("warning", name, s, t);
//    }
  }
  
  /**
   * Logging can require allocation of objects, so we wrap the
   * logger so that failures are silently ignored.
   * 
   * @param s string to print
   */
  static protected void logInfo(String name, String s) {
    logStdErr("info", name, s, null);
//    if (PREFER_STDERR) {
//      logStdErr("info", name, s, null);
//      return;
//    }
//    try {
//      log.info(name + ": " + s);
//    }
//    catch (Throwable t) {
//      logStdErr("info", name, s, t);
//    }
  }
  
  /**
   * Logging can require allocation of objects, so we wrap the
   * logger so that failures are silently ignored.
   * 
   * @param s string to print
   */
  static protected void logFine(String name, String s) {
    if (DEBUG) {
      logStdErr("fine", name, s, null);
    }
//    if (DEBUG && PREFER_STDERR) {
//      logStdErr("fine", name, s, null);
//      return;
//    }
//    try {
//      log.fine(name + ": " + s);
//    }
//    catch (Throwable t) {
//      if (DEBUG) {
//        logStdErr("fine", name, s, null);
//      }
//    }
  }
  
  private static volatile boolean stopping;
  
  /**
   * This starts up the watchdog and proctor threads.
   * This method is called when a Cache is created.
   */
  public static void startThreads() {
    stopping = false;
    startWatchDog();
    startProctor();
  }
  /**
   * This stops the threads that implement this service.
   * This method is called when a Cache is closed.
   */
  public static void stopThreads() {
    // this method fixes bug 45409
    stopping = true;
    stopProctor();
    stopWatchDog();
  }
}