All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.apache.solr.cloud.api.collections.OverseerStatusCmd Maven / Gradle / Ivy

There is a newer version: 9.7.0
Show newest version
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.solr.cloud.api.collections;

import com.codahale.metrics.Timer;
import java.lang.invoke.MethodHandles;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import org.apache.solr.cloud.OverseerTaskProcessor;
import org.apache.solr.cloud.Stats;
import org.apache.solr.common.cloud.ClusterState;
import org.apache.solr.common.cloud.ZkNodeProps;
import org.apache.solr.common.cloud.ZkStateReader;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.common.util.SimpleOrderedMap;
import org.apache.solr.util.stats.MetricUtils;
import org.apache.zookeeper.data.Stat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * This command returns stats about the Overseer, the cluster state updater and collection API
 * activity occurring within the current Overseer node (this is important because distributed
 * operations occurring on other nodes are not included in these stats, for example
 * distributed cluster state updates or Per Replica States updates).
 *
 * 

More fundamentally, when the Collection API command execution is distributed, this specific * command is not being run on the Overseer anyway (but then not much is running on the * overseer as cluster state updates are distributed as well) so Overseer stats and status can't be * returned and actually do not even make sense. Zookeeper based queue metrics do not make sense * either because Zookeeper queues are then not used. * *

The {@link Stats} instance returned by {@link CollectionCommandContext#getOverseerStats()} * when running in the Overseer is created in Overseer.start() and passed to the cluster state * updater from where it is also propagated to the various Zookeeper queues to register various * events. This class is the only place where it is used in the Collection API implementation, and * only to return results. * *

TODO: create a new command returning node specific Collection API/Config set API/cluster state * updates stats such as success and failures? * *

The structure of the returned results is as follows: * *

    *
  • {@code leader}: {@code ID} of the current overseer leader node *
  • {@code overseer_queue_size}: count of entries in the {@code /overseer/queue} * Zookeeper queue/directory *
  • {@code overseer_work_queue_size}: count of entries in the {@code * /overseer/queue-work} Zookeeper queue/directory *
  • {@code overseer_collection_queue_size}: count of entries in the {@code * /overseer/collection-queue-work} Zookeeper queue/directory *
  • {@code overseer_operations}: map (of maps) of success and error counts for * operations. The operations (keys) tracked in this map are: *
      *
    • {@code am_i_leader} (Overseer checking it is still the elected Overseer as it * processes cluster state update messages) *
    • {@code configset_}{@code } *
    • Cluster state change operation names from {@link * org.apache.solr.common.params.CollectionParams.CollectionAction} (not all of them!) * and {@link org.apache.solr.cloud.overseer.OverseerAction} (the complete list: {@code * create}, {@code delete}, {@code createshard}, {@code deleteshard}, {@code * addreplica}, {@code addreplicaprop}, {@code deletereplicaprop}, {@code * balanceshardunique}, {@code modifycollection}, {@code state}, {@code leader}, {@code * deletecore}, {@code addroutingrule}, {@code removeroutingrule}, {@code * updateshardstate}, {@code downnode} and {@code quit} with this last one unlikely to * be observed since the Overseer is exiting right away) *
    • {@code update_state} (when Overseer cluster state updater persists changes in * Zookeeper) *
    * For each key, the value is a map composed of: *
      *
    • {@code requests}: success count of the given operation *
    • {@code errors}: error count of the operation *
    • More metrics (see below) *
    *
  • {@code collection_operations}: map (of maps) of success and error counts for * collection related operations. The operations(keys) tracked in this map are all * operations that start with {@code collection_}, but the {@code collection_} prefix is * stripped of the returned value. Possible keys are therefore: *
      *
    • {@code am_i_leader}: originating in a stat called {@code collection_am_i_leader} * representing Overseer checking it is still the elected Overseer as it processes * Collection API and Config Set API messages. *
    • Collection API operation names from {@link * org.apache.solr.common.params.CollectionParams.CollectionAction} (the stripped {@code * collection_} prefix gets added in {@link * OverseerCollectionMessageHandler#getTimerName(String)}) *
    * For each key, the value is a map composed of: *
      *
    • {@code requests}: success count of the given operation *
    • {@code errors}: error count of the operation *
    • {@code recent_failures}: an optional entry containing a list of maps, each map * having two entries, one with key {@code request} with a failed request properties (a * {@link ZkNodeProps}) and the other with key {@code response} with the corresponding * response properties (a {@link org.apache.solr.client.solrj.SolrResponse}). *
    • More metrics (see below) *
    *
  • {@code overseer_queue}: metrics on operations done on the Zookeeper queue {@code * /overseer/queue} (see metrics below).
    * The operations that can be done on the queue and that can be keys whose values are a * metrics map are: *
      *
    • {@code offer} *
    • {@code peek} *
    • {@code peek_wait} *
    • {@code peek_wait_forever} *
    • {@code peekTopN_wait} *
    • {@code peekTopN_wait_forever} *
    • {@code poll} *
    • {@code remove} *
    • {@code remove_event} *
    • {@code take} *
    *
  • {@code overseer_internal_queue}: same as above but for queue {@code * /overseer/queue-work} *
  • {@code collection_queue}: same as above but for queue {@code * /overseer/collection-queue-work} *
* *

Maps returned as values of keys in {@code overseer_operations}, {@code * collection_operations}, {@code overseer_queue}, {@code overseer_internal_queue} * and {@code collection_queue} include additional stats. These stats are provided by {@link * MetricUtils}, and represent metrics on each type of operation execution (be it failed or * successful), see calls to {@link Stats#time(String)}. The metric keys are: * *

    *
  • {@code avgRequestsPerSecond} *
  • {@code 5minRateRequestsPerSecond} *
  • {@code 15minRateRequestsPerSecond} *
  • {@code avgTimePerRequest} *
  • {@code medianRequestTime} *
  • {@code 75thPcRequestTime} *
  • {@code 95thPcRequestTime} *
  • {@code 99thPcRequestTime} *
  • {@code 999thPcRequestTime} *
*/ public class OverseerStatusCmd implements CollApiCmds.CollectionApiCommand { private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); private final CollectionCommandContext ccc; public OverseerStatusCmd(CollectionCommandContext ccc) { this.ccc = ccc; } @Override public void call(ClusterState state, ZkNodeProps message, NamedList results) throws Exception { // If Collection API execution is distributed, we're not running on the Overseer node so can't // return any Overseer stats. if (ccc.getCoreContainer().getDistributedCollectionCommandRunner().isPresent()) { // TODO: introduce a per node status command allowing insight into how Cluster state updates, // Collection API and config set API execution went on that node... return; } ZkStateReader zkStateReader = ccc.getZkStateReader(); String leaderNode = OverseerTaskProcessor.getLeaderNode(zkStateReader.getZkClient()); results.add("leader", leaderNode); Stat stat = new Stat(); zkStateReader.getZkClient().getData("/overseer/queue", null, stat, true); results.add("overseer_queue_size", stat.getNumChildren()); stat = new Stat(); zkStateReader.getZkClient().getData("/overseer/queue-work", null, stat, true); results.add("overseer_work_queue_size", stat.getNumChildren()); stat = new Stat(); zkStateReader.getZkClient().getData("/overseer/collection-queue-work", null, stat, true); results.add("overseer_collection_queue_size", stat.getNumChildren()); NamedList overseerStats = new NamedList<>(); NamedList collectionStats = new NamedList<>(); NamedList stateUpdateQueueStats = new NamedList<>(); NamedList workQueueStats = new NamedList<>(); NamedList collectionQueueStats = new NamedList<>(); Stats stats = ccc.getOverseerStats(); for (Map.Entry entry : stats.getStats().entrySet()) { String key = entry.getKey(); NamedList lst = new SimpleOrderedMap<>(); if (key.startsWith("collection_")) { collectionStats.add(key.substring(11), lst); int successes = stats.getSuccessCount(entry.getKey()); int errors = stats.getErrorCount(entry.getKey()); lst.add("requests", successes); lst.add("errors", errors); List failureDetails = stats.getFailureDetails(key); if (failureDetails != null) { List> failures = new ArrayList<>(); for (Stats.FailedOp failedOp : failureDetails) { SimpleOrderedMap fail = new SimpleOrderedMap<>(); fail.add("request", failedOp.req.getProperties()); fail.add("response", failedOp.resp.getResponse()); failures.add(fail); } lst.add("recent_failures", failures); } } else if (key.startsWith("/overseer/queue_")) { stateUpdateQueueStats.add(key.substring(16), lst); } else if (key.startsWith("/overseer/queue-work_")) { workQueueStats.add(key.substring(21), lst); } else if (key.startsWith("/overseer/collection-queue-work_")) { collectionQueueStats.add(key.substring(32), lst); } else { // overseer stats overseerStats.add(key, lst); int successes = stats.getSuccessCount(entry.getKey()); int errors = stats.getErrorCount(entry.getKey()); lst.add("requests", successes); lst.add("errors", errors); } Timer timer = entry.getValue().requestTime; MetricUtils.addMetrics(lst, timer); } results.add("overseer_operations", overseerStats); results.add("collection_operations", collectionStats); results.add("overseer_queue", stateUpdateQueueStats); results.add("overseer_internal_queue", workQueueStats); results.add("collection_queue", collectionQueueStats); } }