All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.gemstone.gemfire.SystemFailure Maven / Gradle / Ivy

There is a newer version: 2.0-BETA
Show newest version
/*
 * Copyright (c) 2010-2015 Pivotal Software, Inc. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you
 * may not use this file except in compliance with the License. You
 * may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 * implied. See the License for the specific language governing
 * permissions and limitations under the License. See accompanying
 * LICENSE file.
 */
package com.gemstone.gemfire;

import com.gemstone.gemfire.internal.LocalLogWriter;
import com.gemstone.gemfire.internal.LogWriterImpl;
import com.gemstone.gemfire.internal.LogWriterImpl.GemFireThreadGroup;
import com.gemstone.gemfire.internal.SystemFailureTestHook;
import com.gemstone.gemfire.internal.admin.remote.RemoteGfManagerAgent;
import com.gemstone.gemfire.internal.cache.GemFireCacheImpl;
import com.gemstone.gemfire.internal.i18n.LocalizedStrings;

import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;

/**
 * Catches and responds to JVM failure
 * 

* This class represents a catastrophic failure of the system, * especially the Java virtual machine. Any class may, * at any time, indicate that a system failure has occurred by calling * {@link #initiateFailure(Error)} (or, less commonly, * {@link #setFailure(Error)}). *

* In practice, the most common type of failure that is likely to be * reported by an otherwise healthy JVM is {@link OutOfMemoryError}. However, * GemFire will report any occurrence of other {@link VirtualMachineError}s * excluding {@link StackOverflowError} as a JVM failure. *

* When a failure is reported, you must assume that the JVM has broken * its fundamental execution contract with your application. * No programming invariant can be assumed to be true, and your * entire application must be regarded as corrupted. *

Failure Hooks

* GemFire uses this class to disable its distributed system (group * communication) and any open caches. It also provides a hook for you * to respond to after GemFire disables itself. *

Failure WatchDog

* When {@link #startThreads()} is called, a "watchdog" {@link Thread} is started that * periodically checks to see if system corruption has been reported. When * system corruption is detected, this thread proceeds to: *

*

    *
  1. * Close GemFire -- Group communication is ceased (this cache * member recuses itself from the distributed system) and the cache * is further poisoned (it is pointless to try to cleanly close it at this * point.). *

    * After this has successfully ended, we launch a *

  2. *
  3. * failure action, a user-defined Runnable * {@link #setFailureAction(Runnable)}. * By default, this Runnable performs nothing. If you feel you need to perform * an action before exiting the JVM, this hook gives you a * means of attempting some action. Whatever you attempt should be extremely * simple, since your Java execution environment has been corrupted. *

    * GemStone recommends that you employ * * Java Service Wrapper to detect when your JVM exits and to perform * appropriate failure and restart actions. *

  4. *
  5. * Finally, if the application has granted the watchdog permission to exit the JVM * (via {@link #setExitOK(boolean)}), the watchdog calls {@link System#exit(int)} with * an argument of 1. If you have not granted this class permission to * close the JVM, you are strongly advised to call it in your * failure action (in the previous step). *
  6. *
*

* Each of these actions will be run exactly once in the above described * order. However, if either step throws any type of error ({@link Throwable}), * the watchdog will assume that the JVM is still under duress (esp. an * {@link OutOfMemoryError}), will wait a bit, and then retry the failed action. *

* It bears repeating that you should be very cautious of any Runnables you * ask this class to run. By definition the JVM is very sick * when failure has been signalled. *

*

Failure Proctor

* In addition to the failure watchdog, {@link #startThreads()} creates a second * thread (the "proctor") that monitors free memory. It does this by examining * {@link Runtime#freeMemory() free memory}, * {@link Runtime#totalMemory() total memory} and * {@link Runtime#maxMemory() maximum memory}. If the amount of available * memory stays below a given * {@link #setFailureMemoryThreshold(long) threshold}, for * more than {@link #WATCHDOG_WAIT} seconds, the watchdog is notified. *

* Note that the proctor can be effectively disabled by * {@link SystemFailure#setFailureMemoryThreshold(long) setting} the failure memory threshold * to a negative value. *

* The proctor is a second line of defense, attempting to detect * OutOfMemoryError conditions in circumstances where nothing alerted the * watchdog. For instance, a third-party jar might incorrectly handle this * error and leave your virtual machine in a "stuck" state. *

* Note that the proctor does not relieve you of the obligation to * follow the best practices in the next section. *

Best Practices

*

Catch and Handle fatal JVM errors

* If you feel obliged to catch either {@link Error}, or * {@link Throwable}, you mustalso check for * fatal JVM error like so: *

*

        catch (Error e) {
          if (SystemFailure.{@link #isJVMFailureError(Error) isJVMFailureError}(e)) {
            SystemFailure.{@link #initiateFailure(Error) initiateFailure}(e);
            // If this ever returns, rethrow the error. We're poisoned
            // now, so don't let this thread continue.
            throw e;
          }
          ...
        }
 * 
*

Periodically Check For Errors

* Check for serious system errors at * appropriate points in your algorithms. You may elect to use * the {@link #checkFailure()} utility function, but you are * not required to (you could just see if {@link SystemFailure#getFailure()} * returns a non-null result). *

* A job processing loop is a good candidate, for * instance, in com.gemstone.org.jgroups.protocols.UDP#run(), * which implements {@link Thread#run}: *

*

         for (;;)  {
           SystemFailure.{@link #checkFailure() checkFailure}();
           if (mcast_recv_sock == null || mcast_recv_sock.isClosed()) break;
           if (Thread.currentThread().isInterrupted()) break;
          ...
 * 
*

Create Logging ThreadGroups

* If you create any Thread, a best practice is to catch severe errors * and signal failure appropriately. One trick to do this is to create a * ThreadGroup that handles uncaught exceptions by overriding * {@link ThreadGroup#uncaughtException(Thread, Throwable)} and to declare * your thread as a member of that {@link ThreadGroup}. This also has a * significant side-benefit in that most uncaught exceptions * can be detected: *

*

    ThreadGroup tg = new ThreadGroup("Worker Threads") {
        public void uncaughtException(Thread t, Throwable e) {
          // Do this *before* any object allocation in case of
          // OutOfMemoryError (for instance)
          if (e instanceof Error && SystemFailure.{@link #isJVMFailureError(Error) isJVMFailureError}(
              (Error)e)) {
            SystemFailure.{@link #setFailure(Error) setFailure}((Error)e); // don't throw
          }
          String s = "Uncaught exception in thread " + t;
          system.getLogWriter().severe(s, e);
        }
        Thread t = new Thread(myRunnable, tg, "My Thread");
        t.start();
      }; * 
*

*

Catches of Error and Throwable Should Check for Failure

* Keep in mind that peculiar or flat-outimpossible exceptions may * ensue after a fatal JVM error has been thrown anywhere in * your virtual machine. Whenever you catch {@link Error} or {@link Throwable}, * you should also make sure that you aren't dealing with a corrupted JVM: *

*

        catch (Throwable t) {
          Error err;
          if (t instanceof Error && SystemFailure.{@link #isJVMFailureError(Error) isJVMFailureError}(
              err = (Error)t)) {
            SystemFailure.{@link #initiateFailure(Error) initiateFailure}(err);
            // If this ever returns, rethrow the error. We're poisoned
            // now, so don't let this thread continue.
            throw err;
          }
          // Whenever you catch Error or Throwable, you must also
          // check for fatal JVM error (see above).  However, there is
          // _still_ a possibility that you are dealing with a cascading
          // error condition, so you also need to check to see if the JVM
          // is still usable:
          SystemFailure.{@link #checkFailure() checkFailure}();
          ...
        }
 * 
* @author jpenney * @author swale * @since 5.1 */ @SuppressFBWarnings(value="DM_GC", justification="This class performs System.gc as last ditch effort during out-of-memory condition.") public final class SystemFailure { /** * Preallocated error messages\ * LocalizedStrings may use memory (in the form of an iterator) * so we must get the translated messages in advance. **/ static final String JVM_CORRUPTION = LocalizedStrings.SystemFailure_JVM_CORRUPTION_HAS_BEEN_DETECTED.toLocalizedString(); static final String CALLING_SYSTEM_EXIT = LocalizedStrings.SystemFailure_SINCE_THIS_IS_A_DEDICATED_CACHE_SERVER_AND_THE_JVM_HAS_BEEN_CORRUPTED_THIS_PROCESS_WILL_NOW_TERMINATE_PERMISSION_TO_CALL_SYSTEM_EXIT_INT_WAS_GIVEN_IN_THE_FOLLOWING_CONTEXT.toLocalizedString(); public static final String DISTRIBUTION_HALTED_MESSAGE = LocalizedStrings.SystemFailure_DISTRIBUTION_HALTED_DUE_TO_JVM_CORRUPTION.toLocalizedString(); public static final String DISTRIBUTED_SYSTEM_DISCONNECTED_MESSAGE = LocalizedStrings.SystemFailure_DISTRIBUTED_SYSTEM_DISCONNECTED_DUE_TO_JVM_CORRUPTION.toLocalizedString(); /** * the underlying failure * * This is usually an instance of {@link VirtualMachineError}, but it * is not required to be such. * * @see #getFailure() * @see #initiateFailure(Error) */ protected static volatile Error failure = null; /** * user-defined runnable to run last * * @see #setFailureAction(Runnable) */ private static volatile Runnable failureAction = new Runnable() { public void run() { System.err.println(JVM_CORRUPTION); failure.printStackTrace(); } }; /** * @see #setExitOK(boolean) */ private static volatile boolean exitOK = false; /** * If we're going to exit the JVM, I want to be accountable for who * told us it was OK. */ private static volatile Throwable exitExcuse; /** * Indicate whether it is acceptable to call {@link System#exit(int)} after * failure processing has completed. *

* This may be dynamically modified while the system is running. * * @param newVal true if it is OK to exit the process * @return the previous value */ public static boolean setExitOK(boolean newVal) { boolean result = exitOK; exitOK = newVal; if (exitOK) { exitExcuse = new Throwable("SystemFailure exitOK set"); } else { exitExcuse = null; } return result; } /** * Returns true if the given Error is a fatal to the JVM and it should be shut * down. Code should call {@link #initiateFailure(Error)} or * {@link #setFailure(Error)} if this returns true. */ public static boolean isJVMFailureError(Error err) { // all VirtualMachineErrors are not fatal to the JVM, in particular // StackOverflowError is not if (err instanceof OutOfMemoryError) { // ignore OOMEs thrown by Spark String message = err.getMessage(); return !message.contains("Unable to acquire") && !message.contains("error while calling spill") && !message.contains("enough memory for aggregation") && !message.contains("enough memory to grow"); } else { return false; } } /** * Check to see if a throwable is a JVM error and handle it if so. */ public static void checkThrowable(Throwable e) { Error err; if (e instanceof Error && SystemFailure.isJVMFailureError( err = (Error)e)) { SystemFailure.initiateFailure(err); // If this ever returns, rethrow the error. We're poisoned // now, so don't let this thread continue. throw err; } // Whenever you catch Error or Throwable, you must also // check for fatal JVM error (see above). However, there is // _still_ a possibility that you are dealing with a cascading // error condition, so you also need to check to see if the JVM // is still usable: SystemFailure.checkFailure(); } /** * Disallow instance creation */ private SystemFailure() { } /** * Synchronizes access to state variables, used to notify the watchdog * when to run */ private static final Object failureSync = new Object(); /** * True if we have closed GemFire * * @see #emergencyClose() */ private static volatile boolean gemfireCloseCompleted = false; /** * True if we have completed the user-defined failure action * * @see #setFailureAction(Runnable) */ private static volatile boolean failureActionCompleted = false; /** * This is a logging ThreadGroup, created only once. */ private final static ThreadGroup tg; static { tg = new GemFireThreadGroup("SystemFailure Watchdog Threads") { // If the watchdog is correctly written, this will never get executed. // However, there's no reason for us not to eat our own dog food // (har, har) -- see the javadoc above. @Override public void uncaughtException(Thread t, Throwable e) { // Uhhh...if the watchdog is running, we *know* there's some // sort of serious error, no need to check for it here. System.err.println("Internal error in SystemFailure watchdog:" + e); e.printStackTrace(); } }; } /** * This is the amount of time, in seconds, the watchdog periodically awakens * to see if the system has been corrupted. *

* The watchdog will be explicitly awakened by calls to * {@link #setFailure(Error)} or {@link #initiateFailure(Error)}, but * it will awaken of its own accord periodically to check for failure even * if the above calls do not occur. *

* This can be set with the system property * gemfire.WATCHDOG_WAIT. The default is 15 sec. */ static public final int WATCHDOG_WAIT = Integer .getInteger("gemfire.WATCHDOG_WAIT", 15).intValue(); /** * This is the watchdog thread * * @guarded.By {@link #failureSync} */ private static Thread watchDog; /** * Start the watchdog thread, if it isn't already running. */ private static void startWatchDog() { if (failureActionCompleted) { // Our work is done, don't restart return; } synchronized (failureSync) { if (watchDog != null && watchDog.isAlive()) { return; } watchDog = new Thread(tg, new Runnable() { public void run() { runWatchDog(); } }, "SystemFailure WatchDog"); watchDog.setDaemon(true); watchDog.start(); } } private static void stopWatchDog() { synchronized (failureSync) { stopping = true; if (watchDog != null && watchDog.isAlive()) { failureSync.notifyAll(); try { watchDog.join(100); } catch (InterruptedException ignore) { } if (watchDog.isAlive()) { watchDog.interrupt(); try { watchDog.join(1000); } catch (InterruptedException ignore) { } } } watchDog = null; } } /** * This is the run loop for the watchdog thread. */ static protected void runWatchDog() { boolean warned = false; logFine(WATCHDOG_NAME, "Starting"); try { basicLoadEmergencyClasses(); } catch (ExceptionInInitializerError e) { // Uhhh...are we shutting down? boolean noSurprise = false; Throwable cause = e.getCause(); if (cause != null) { if (cause instanceof IllegalStateException) { String msg = cause.getMessage(); if (msg.indexOf("Shutdown in progress") >= 0) { noSurprise = true; } } } if (!noSurprise) { logWarning(WATCHDOG_NAME, "Unable to load GemFire classes: ", e); } // In any event, we're toast return; } catch (CancelException e) { // ignore this because we are shutting down anyway } catch (Throwable t) { logWarning(WATCHDOG_NAME, "Unable to initialize watchdog", t); return; } for (;;) { if (stopping) { return; } try { // Sleep or get notified... synchronized (failureSync) { if (stopping) { return; } logFine(WATCHDOG_NAME, "Waiting for disaster"); try { failureSync.wait(WATCHDOG_WAIT * 1000); } catch (InterruptedException e) { // Ignore } if (stopping) { return; } } // Poke nose in the air, take a sniff... if (failureActionCompleted) { // early out, for testing logInfo(WATCHDOG_NAME, "all actions completed; exiting"); } if (failure == null) { // Tail wag. Go back to sleep. logFine(WATCHDOG_NAME, "no failure detected"); continue; } // BOW WOW WOW WOW WOW! Corrupted system. if (!warned ) { warned = logWarning(WATCHDOG_NAME, "failure detected", failure); } // If any of the following fail, we will go back to sleep and // retry. if (!gemfireCloseCompleted) { logInfo(WATCHDOG_NAME, "closing GemFire"); try { emergencyClose(); } catch (Throwable t) { logWarning(WATCHDOG_NAME, "trouble closing GemFire", t); continue; // go back to sleep } gemfireCloseCompleted = true; } if (!failureActionCompleted) { // avoid potential race condition setting the runnable Runnable r = failureAction; if (r != null) { logInfo(WATCHDOG_NAME, "running user's runnable"); try { r.run(); } catch (Throwable t) { logWarning(WATCHDOG_NAME, "trouble running user's runnable", t); continue; // go back to sleep } } failureActionCompleted = true; } stopping = true; stopProctor(); if (exitOK) { logWarning(WATCHDOG_NAME, // No "+" in this long message, we're out of memory! CALLING_SYSTEM_EXIT, exitExcuse); // ATTENTION: there are VERY FEW places in GemFire where it is // acceptable to call System.exit. This is one of those // places... System.exit(1); } // Our job here is done logInfo(WATCHDOG_NAME, "exiting"); return; } catch (Throwable t) { // We *never* give up. NEVER EVER! logWarning(WATCHDOG_NAME, "thread encountered a problem: " + t, t); } } // for } /** * Spies on system statistics looking for low memory threshold * * Well, if you're gonna have a watchdog, why not a watch CAT???? * * @guarded.By {@link #failureSync} * @see #minimumMemoryThreshold */ private static Thread proctor; /** * This mutex controls access to {@link #firstStarveTime} and * {@link #minimumMemoryThreshold}. *

* I'm hoping that a fat lock is never created here, so that * an object allocation isn't necessary to acquire this * mutex. You'd have to have A LOT of contention on this mutex * in order for a fat lock to be created, which indicates IMHO * a serious problem in your applications. */ private static final Object memorySync = new Object(); /** * This is the minimum amount of memory that the proctor will * tolerate before declaring a system failure. * * @see #setFailureMemoryThreshold(long) * @guarded.By {@link #memorySync} */ static long minimumMemoryThreshold = Long.getLong( "gemfire.SystemFailure.chronic_memory_threshold", 1048576).longValue(); /** * This is the interval, in seconds, that the proctor * thread will awaken and poll system free memory. * * The default is 1 sec. This can be set using the system property * gemfire.SystemFailure.MEMORY_POLL_INTERVAL. * * @see #setFailureMemoryThreshold(long) */ static final public long MEMORY_POLL_INTERVAL = Long.getLong( "gemfire.SystemFailure.MEMORY_POLL_INTERVAL", 1).longValue(); /** * This is the maximum amount of time, in seconds, that the proctor thread * will tolerate seeing free memory stay below * {@link #setFailureMemoryThreshold(long)}, after which point it will * declare a system failure. * * The default is 15 sec. This can be set using the system property * gemfire.SystemFailure.MEMORY_MAX_WAIT. * * @see #setFailureMemoryThreshold(long) */ static final public long MEMORY_MAX_WAIT = Long.getLong( "gemfire.SystemFailure.MEMORY_MAX_WAIT", 15).longValue(); /** * Flag that determines whether or not we monitor memory on our own. * If this flag is set, we will check freeMemory, invoke GC if free memory * gets low, and start throwing our own OutOfMemoryException if * * The default is false, so this monitoring is turned off. This monitoring has been found * to be unreliable in non-Sun VMs when the VM is under stress or behaves in unpredictable ways. * * @since 6.5 */ static final public boolean MONITOR_MEMORY = Boolean.getBoolean( "gemfire.SystemFailure.MONITOR_MEMORY"); /** * Start the proctor thread, if it isn't already running. * * @see #proctor */ private static void startProctor() { if (failure != null) { // no point! notifyWatchDog(failure); return; } synchronized (failureSync) { if (proctor != null && proctor.isAlive()) { return; } proctor = new Thread(tg, new Runnable() { public void run() { runProctor(); } }, "SystemFailure Proctor"); proctor.setDaemon(true); proctor.start(); } } private static void stopProctor() { synchronized (failureSync) { stopping = true; if (proctor != null && proctor.isAlive()) { proctor.interrupt(); try { proctor.join(1000); } catch (InterruptedException ignore) { } } proctor = null; } } /** * Symbolic representation of an invalid starve time */ static private final long NEVER_STARVED = Long.MAX_VALUE; /** * this is the last time we saw memory starvation * * @guarded.By {@link #memorySync}}} */ static private long firstStarveTime = NEVER_STARVED; /** * This is the previous measure of total memory. If it changes, * we reset the proctor's starve statistic. */ static private long lastTotalMemory = 0; /** * This is the run loop for the proctor thread (formally known * as the "watchcat" (grin) */ static protected void runProctor() { // Note that the javadocs say this can return Long.MAX_VALUE. // If it does, the proctor will never do its job... final long maxMemory = Runtime.getRuntime().maxMemory(); // Allocate this error in advance, since it's too late once // it's been detected! final OutOfMemoryError oome = new OutOfMemoryError(LocalizedStrings.SystemFailure_0_MEMORY_HAS_REMAINED_CHRONICALLY_BELOW_1_BYTES_OUT_OF_A_MAXIMUM_OF_2_FOR_3_SEC.toLocalizedString(new Object[] {PROCTOR_NAME, Long.valueOf(minimumMemoryThreshold), Long.valueOf(maxMemory), Integer.valueOf(WATCHDOG_WAIT)})); // Catenation, but should be OK when starting up logFine(PROCTOR_NAME, "Starting, threshold = " + minimumMemoryThreshold + "; max = " + maxMemory); for (;;) { if (stopping) { return; } try { //*** catnap... try { Thread.sleep(MEMORY_POLL_INTERVAL * 1000); } catch (InterruptedException e) { // ignore } if (stopping) { return; } //*** Twitch ear, take a bath... if (failureActionCompleted) { // it's all over, we're late return; } if (failure != null) { notifyWatchDog(failure); // wake the dog, just in case logFine(PROCTOR_NAME, "Failure has been reported, exiting"); return; } if(!MONITOR_MEMORY) { continue; } //*** Sit up, stretch... long totalMemory = Runtime.getRuntime().totalMemory(); if (totalMemory < maxMemory) { // We haven't finished growing the heap, so no worries...yet if (DEBUG) { // This message has catenation, we don't want this in // production code :-) logFine(PROCTOR_NAME, "totalMemory (" + totalMemory + ") < maxMemory (" + maxMemory + ")"); } firstStarveTime = NEVER_STARVED; continue; } if (lastTotalMemory < totalMemory) { // Don't get too impatient if the heap just now grew lastTotalMemory = totalMemory; // now we're maxed firstStarveTime = NEVER_STARVED; // reset the clock continue; } lastTotalMemory = totalMemory; // make a note of this //*** Hey, is that the food bowl? // At this point, freeMemory really indicates how much // trouble we're in. long freeMemory = Runtime.getRuntime().freeMemory(); if(freeMemory==0) { /* * This is to workaround X bug #41821 in JRockit. * Often, Jrockit returns 0 from Runtime.getRuntime().freeMemory() * Allocating this one object and calling again seems to workaround the problem. */ new Object(); freeMemory = Runtime.getRuntime().freeMemory(); } // Grab the threshold and starve time once, under mutex, because // it's publicly modifiable. long curThreshold; long lastStarveTime; synchronized (memorySync) { curThreshold = minimumMemoryThreshold; lastStarveTime = firstStarveTime; } if (freeMemory >= curThreshold /* enough memory */ || curThreshold == 0 /* disabled */) { // Memory is FINE, reset everything if (DEBUG) { // This message has catenation, we don't want this in // production code :-) logFine(PROCTOR_NAME, "Current free memory is: " + freeMemory); } if (lastStarveTime != NEVER_STARVED) { logFine(PROCTOR_NAME, "...low memory has self-corrected."); } synchronized (memorySync) { firstStarveTime = NEVER_STARVED; } continue; } // Memory is low //*** Leap to feet, nose down, tail switching... long now = System.currentTimeMillis(); if (lastStarveTime == NEVER_STARVED) { // first sighting if (DEBUG) { // Catenation in this message, don't put in production logFine(PROCTOR_NAME, "Noting current memory " + freeMemory + " is less than threshold " + curThreshold); } else { logWarning( PROCTOR_NAME, "Noting that current memory available is less than the currently designated threshold", null); } synchronized (memorySync) { firstStarveTime = now; } // Trust the JVM do a full gc when it is needed //System.gc(); // at least TRY... continue; } //*** squirm, wait for the right moment...wait...wait... if (now - lastStarveTime < MEMORY_MAX_WAIT * 1000) { // Very recent; problem may correct itself. if (DEBUG) { // catenation logFine(PROCTOR_NAME, "...memory is still below threshold: " + freeMemory); } else { logWarning( PROCTOR_NAME, "Noting that current memory available is still below currently designated threshold", null); } continue; } //*** Meow! Meow! MEOWWWW!!!!! // Like any smart cat, let the Dog do all the work. logWarning(PROCTOR_NAME, "Memory is chronically low; setting failure!", null); SystemFailure.setFailure(oome); notifyWatchDog(failure); return; // we're done! } catch (Throwable t) { logWarning(PROCTOR_NAME, "thread encountered a problem", t); // We *never* give up. NEVER EVER! } } // for } /** * Enables some fine logging */ static private final boolean DEBUG = false; /** * If true, we track the progress of emergencyClose * on System.err */ static public final boolean TRACE_CLOSE = false; /** * the level at which to log */ static private final int LOG_LEVEL = DEBUG ? LogWriterImpl.FINE_LEVEL : LogWriterImpl.INFO_LEVEL; /** * This is a desperation logger that prints to System.out. */ static protected final LogWriterImpl log = new LocalLogWriter(LOG_LEVEL, System.out); static protected final String WATCHDOG_NAME = "SystemFailure Watchdog"; static protected final String PROCTOR_NAME = "SystemFailure Proctor"; /** * break any potential circularity in {@link #loadEmergencyClasses()} */ private static volatile boolean emergencyClassesLoaded = false; /** * Since it requires object memory to unpack a jar file, * make sure this JVM has loaded the classes necessary for * closure before it becomes necessary to use them. *

* Note that just touching the class in order to load it * is usually sufficient, so all an implementation needs * to do is to reference the same classes used in * {@link #emergencyClose()}. Just make sure to do it while * you still have memory to succeed! */ public static void loadEmergencyClasses() { // This method was called to basically load this class // and invoke its static initializers. Now that we don't // use statics to start the threads all we need to do is // call startThreads. The watchdog thread will call basicLoadEmergencyClasses. startThreads(); } private static void basicLoadEmergencyClasses() { if (emergencyClassesLoaded) return; emergencyClassesLoaded = true; SystemFailureTestHook.loadEmergencyClasses(); // bug 50516 GemFireCacheImpl.loadEmergencyClasses(); RemoteGfManagerAgent.loadEmergencyClasses(); } /** * Attempt to close any and all GemFire resources. * * The contract of this method is that it should not * acquire any synchronization mutexes nor create any objects. *

* The former is because the system is in an undefined state and * attempting to acquire the mutex may cause a hang. *

* The latter is because the likelihood is that we are invoking * this method due to memory exhaustion, so any attempt to create * an object will also cause a hang. *

* This method is not meant to be called directly (but, well, I * guess it could). It is public to document the contract * that is implemented by emergencyClose in other * parts of the system. */ public static void emergencyClose() { // Make the cache (more) useless and inaccessible... if (TRACE_CLOSE) { System.err.println("SystemFailure: closing GemFireCache"); } GemFireCacheImpl.emergencyClose(); // Arcane strange DS's exist in this class: if (TRACE_CLOSE) { System.err.println("SystemFailure: closing admins"); } RemoteGfManagerAgent.emergencyClose(); // If memory was the problem, make an explicit attempt at // this point to clean up. // Trust the JVM do a full gc when it is needed //System.gc(); // This will fail if we're out of memory?/ if (TRACE_CLOSE) { System.err.println("SystemFailure: end of emergencyClose"); } } /** * Throw the system failure. * * This method does not return normally. *

* Unfortunately, attempting to create a new Throwable at this * point may cause the thread to hang (instead of generating * another OutOfMemoryError), so we have to make do with whatever * Error we have, instead of wrapping it with one pertinent * to the current context. See bug 38394. * * @throws Error */ static private void throwFailure() throws InternalGemFireError, Error { // Do not return normally... if (failure != null) throw failure; } /** * Notifies the watchdog thread (assumes that {@link #failure} has been set) */ private static void notifyWatchDog(Error err) { startWatchDog(); // just in case synchronized (failureSync) { failure = err; // We (re)set failure here to make findbugs happy. failureSync.notifyAll(); } } /** * Utility function to check for failures. If a failure is * detected, this methods throws an AssertionFailure. * * @see #initiateFailure(Error) * @throws InternalGemFireError if the system has been corrupted * @throws Error if the system has been corrupted and a thread-specific * AssertionError cannot be allocated */ public static void checkFailure() throws InternalGemFireError, Error { if (failure == null) { return; } notifyWatchDog(failure); throwFailure(); } /** * Signals that a system failure has occurred and then throws an * AssertionError. * * @param f the failure to set * @throws IllegalArgumentException if f is null * @throws InternalGemFireError always; this method does not return normally. * @throws Error if a thread-specific AssertionError cannot be allocated. */ public static void initiateFailure(Error f) throws InternalGemFireError, Error { SystemFailure.setFailure(f); throwFailure(); } /** * Set the underlying system failure, if not already set. *

* This method does not generate an error, and should only be used * in circumstances where execution needs to continue, such as when * re-implementing {@link ThreadGroup#uncaughtException(Thread, Throwable)}. * * @param failure the system failure * @throws IllegalArgumentException if you attempt to set the failure to null */ public static void setFailure(Error failure) { if (failure == null) { throw new IllegalArgumentException(LocalizedStrings.SystemFailure_YOU_ARE_NOT_PERMITTED_TO_UNSET_A_SYSTEM_FAILURE.toLocalizedString()); } if (SystemFailureTestHook.errorIsExpected(failure)) { return; } // created (OutOfMemoryError), and no stack frames are created // (StackOverflowError). There is a slight chance that the // very first error may get overwritten, but this avoids the // potential of object creation via a fat lock SystemFailure.failure = failure; notifyWatchDog(failure); } /** * Returns the catastrophic system failure, if any. *

* This is usually (though not necessarily) an instance of * {@link VirtualMachineError}. *

* A return value of null indicates that no system failure has yet been * detected. *

* Object synchronization can implicitly require object creation (fat locks * in JRockit for instance), so the underlying value is not synchronized * (it is a volatile). This means the return value from this call is not * necessarily the first failure reported by the JVM. *

* Note that even if it were synchronized, it would only be a * proximal indicator near the time that the JVM crashed, and may not * actually reflect the underlying root cause that generated the failure. * For instance, if your JVM is running short of memory, this Throwable is * probably an innocent victim and not the actual allocation (or * series of allocations) that caused your JVM to exhaust memory. *

* If this function returns a non-null value, keep in mind that the JVM is * very limited. In particular, any attempt to allocate objects may fail * if the original failure was an OutOfMemoryError. * * @return the failure, if any */ public static Error getFailure() { return failure; } /** * Sets a user-defined action that is run in the event * that failure has been detected. *

* This action is run after the GemFire cache has been shut down. * If it throws any error, it will be reattempted indefinitely until it * succeeds. This action may be dynamically modified while the system * is running. *

* The default action prints the failure stack trace to System.err. * * @see #initiateFailure(Error) * @param action the Runnable to use * @return the previous action */ public static Runnable setFailureAction(Runnable action) { Runnable old = SystemFailure.failureAction; SystemFailure.failureAction = action; return old; } /** * Set the memory threshold under which system failure will be * notified. * * This value may be dynamically modified while the system * is running. The default is 1048576 bytes. This can be set using the * system property gemfire.SystemFailure.chronic_memory_threshold. * * @param newVal threshold in bytes * @return the old threshold * @see Runtime#freeMemory() */ public static long setFailureMemoryThreshold(long newVal) { long result; synchronized (memorySync) { result = minimumMemoryThreshold; minimumMemoryThreshold = newVal; firstStarveTime = NEVER_STARVED; // reset } startProctor(); // just in case return result; } // /** // * For use by GemStone Quality Assurance Only // * // * @deprecated TODO remove this // */ // public static void reset() { // System.gc(); // logWarning("DJP", "do not commit SystemFailure#reset", null); // failure = null; // failureAction = new Runnable() { // public void run() { // System.err.println("(SystemFailure) JVM corruption has been detected!"); // failure.printStackTrace(); // } // }; // gemfireCloseCompleted = false; // failureActionCompleted = false; // synchronized (failureSync) { // if (watchDog != null) { // watchDog.interrupt(); // } // watchDog = null; // if (watchCat != null) { // watchCat.interrupt(); // } // watchCat = null; // } // // startWatchDog(); // startWatchCat(); // } static private boolean logStdErr(String kind, String name, String s, Throwable t) { // As far as I can tell, this code path doesn't allocate // any objects!!!! try { System.err.print(name); System.err.print(": ["); System.err.print(kind); System.err.print("] "); System.err.println(s); if (t != null) { t.printStackTrace(); } return true; } catch (Throwable t2) { // out of luck return false; } } /** * Logging can require allocation of objects, so we wrap the * logger so that failures are silently ignored. * * @param s string to print * @param t the call stack, if any * @return true if the warning got printed */ static protected boolean logWarning(String name, String s, Throwable t) { return logStdErr("warning", name, s, t); // if (PREFER_STDERR) { // return logStdErr("warning", name, s, t); // } // try { // log.warning(name + ": " + s, t); // return true; // } // catch (Throwable t2) { // return logStdErr("warning", name, s, t); // } } /** * Logging can require allocation of objects, so we wrap the * logger so that failures are silently ignored. * * @param s string to print */ static protected void logInfo(String name, String s) { logStdErr("info", name, s, null); // if (PREFER_STDERR) { // logStdErr("info", name, s, null); // return; // } // try { // log.info(name + ": " + s); // } // catch (Throwable t) { // logStdErr("info", name, s, t); // } } /** * Logging can require allocation of objects, so we wrap the * logger so that failures are silently ignored. * * @param s string to print */ static protected void logFine(String name, String s) { if (DEBUG) { logStdErr("fine", name, s, null); } // if (DEBUG && PREFER_STDERR) { // logStdErr("fine", name, s, null); // return; // } // try { // log.fine(name + ": " + s); // } // catch (Throwable t) { // if (DEBUG) { // logStdErr("fine", name, s, null); // } // } } private static volatile boolean stopping; /** * This starts up the watchdog and proctor threads. * This method is called when a Cache is created. */ public static void startThreads() { stopping = false; startWatchDog(); startProctor(); } /** * This stops the threads that implement this service. * This method is called when a Cache is closed. */ public static void stopThreads() { // this method fixes bug 45409 stopping = true; stopProctor(); stopWatchDog(); } }





© 2015 - 2024 Weber Informatics LLC | Privacy Policy