
org.apache.lucene.benchmark.byTask.package-info Maven / Gradle / Ivy
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Benchmarking Lucene By Tasks
*
* This package provides "task based" performance benchmarking of Lucene. One can use the
* predefined benchmarks, or create new ones.
*
*
Contained packages:
*
*
* table of benchmark packages
*
* Package
* Description
*
*
* stats
* Statistics maintained when running benchmark tasks.
*
*
* tasks
* Benchmark tasks.
*
*
* feeds
* Sources for benchmark inputs: documents and queries.
*
*
* utils
* Utilities used for the benchmark, and for the reports.
*
*
* programmatic
* Sample performance test written programmatically.
*
*
*
* Table Of Contents
*
*
* - Benchmarking By Tasks
*
- How to use
*
- Benchmark "algorithm"
*
- Supported tasks/commands
*
- Benchmark properties
*
- Example input algorithm and the result benchmark report.
*
- Results record counting clarified
*
*
*
*
* Benchmarking By Tasks
*
* Benchmark Lucene using task primitives.
*
*
A benchmark is composed of some predefined tasks, allowing for creating an index, adding
* documents, optimizing, searching, generating reports, and more. A benchmark run takes an
* "algorithm" file that contains a description of the sequence of tasks making up the run, and some
* properties defining a few additional characteristics of the benchmark run.
*
*
How to use
*
* The easiest way to run a benchmarks is using gradle:
*
*
* - ./gradlew -p lucene/benchmark getReuters run
* - would run the micro-standard.alg
"algorithm".
* - ./gradlew -p lucene/benchmark getReuters run -Ptask.alg=conf/compound-penalty.alg
* - would run the compound-penalty.alg
"algorithm".
* - ./gradlew -p lucene/benchmark getReuters run -Ptask.alg=[full-path-to-your-alg-file]
* - would run your perf test
"algorithm".
* - java org.apache.lucene.benchmark.byTask.programmatic.Sample
* - would run a performance test programmatically - without using an alg file. This is less
* readable, and less convenient, but possible.
*
*
* You may find existing tasks sufficient for defining the benchmark you need, otherwise,
* you can extend the framework to meet your needs, as explained herein.
*
*
Each benchmark run has a DocMaker and a QueryMaker. These two should usually match, so that
* "meaningful" queries are used for a certain collection. Properties set at the header of the alg
* file define which "makers" should be used. You can also specify your own makers, extending
* DocMaker and implementing QueryMaker.
*
*
*
* Note: since 2.9, DocMaker is a concrete class which accepts a ContentSource. In most
* cases, you can use the DocMaker class to create Documents, while providing your own ContentSource
* implementation. For example, the current Benchmark package includes ContentSource implementations
* for TREC, Enwiki and Reuters collections, as well as others like LineDocSource which reads a
* 'line' file produced by WriteLineDocTask.
*
*
*
* Benchmark .alg file contains the benchmark "algorithm". The syntax is described below. Within
* the algorithm, you can specify groups of commands, assign them names, specify commands that
* should be repeated, do commands in serial or in parallel, and also control the speed of "firing"
* the commands.
*
*
This allows, for instance, to specify that an index should be opened for update, documents
* should be added to it one by one but not faster than 20 docs a minute, and, in parallel with
* this, some N queries should be searched against that index, again, no more than 2 queries a
* second. You can have the searches all share an index reader, or have them each open its own
* reader and close it afterwords.
*
*
If the commands available for use in the algorithm do not meet your needs, you can add
* commands by adding a new task under org.apache.lucene.benchmark.byTask.tasks - you should extend
* the PerfTask abstract class. Make sure that your new task class name is suffixed by Task. Assume
* you added the class "WonderfulTask" - doing so also enables the command "Wonderful" to be used in
* the algorithm.
*
*
External classes: It is sometimes useful to invoke the benchmark package with your
* external alg file that configures the use of your own doc/query maker and or html parser. You can
* work this out without modifying the benchmark package code, by passing your class path with the
* benchmark.ext.classpath property:
*
*
* - ./gradlew -p lucene/benchmark run -Ptask.alg=[full-path-to-your-alg-file] -Dbenchmark.ext.classpath=/mydir/classes -Dtask.mem=512M
*
*
* External tasks: When writing your own tasks under a package other than
* org.apache.lucene.benchmark.byTask.tasks specify that package thru the alt.tasks.packages property.
*
*
Benchmark "algorithm"
*
* The following is an informal description of the supported syntax.
*
*
* - Measuring: When a command is executed, statistics for the elapsed execution time and
* memory consumption are collected. At any time, those statistics can be printed, using one
* of the available ReportTasks.
*
- Comments start with '#'.
*
- Serial sequences are enclosed within '{ }'.
*
- Parallel sequences are enclosed within '[ ]'
*
- Sequence naming: To name a sequence, put '"name"' just after '{' or '['.
* Example - { "ManyAdds" AddDoc } : 1000000 - would name
* the sequence of 1M add docs "ManyAdds", and this name would later appear in statistic
* reports. If you don't specify a name for a sequence, it is given one: you can see it as the
* algorithm is printed just before benchmark execution starts.
* - Repeating: To repeat sequence tasks N times, add ':
* N' just after the sequence closing tag - '}' or
* ']' or '>'.
* Example - [ AddDoc ] : 4 - would do 4 addDoc in
* parallel, spawning 4 threads at once.
* Example - [ AddDoc AddDoc ] : 4 - would do 8 addDoc in
* parallel, spawning 8 threads at once.
* Example - { AddDoc } : 30 - would do addDoc 30 times in
* a row.
* Example - { AddDoc AddDoc } : 30 - would do addDoc 60
* times in a row.
* Exhaustive repeating: use * instead of a number
* to repeat exhaustively. This is sometimes useful, for adding as many files as a doc maker
* can create, without iterating over the same file again, especially when the exact number of
* documents is not known in advance. For instance, TREC files extracted from a zip file.
* Note: when using this, you must also set content.source.forever to false.
* Example - { AddDoc } : * - would add docs until the doc
* maker is "exhausted".
* - Command parameter: a command can optionally take a single parameter. If the certain
* command does not support a parameter, or if the parameter is of the wrong type, reading the
* algorithm will fail with an exception and the test would not start. Currently the following
* tasks take optional parameters:
*
* - AddDoc takes a numeric parameter, indicating the required size of added
* document. Note: if the DocMaker implementation used in the test does not support
* makeDoc(size), an exception would be thrown and the test would fail.
*
- DeleteDoc takes numeric parameter, indicating the docid to be deleted. The
* latter is not very useful for loops, since the docid is fixed, so for deletion in
* loops it is better to use the
doc.delete.step
property.
* - SetProp takes a
name,value
mandatory param, ',' used as a
* separator.
* - SearchTravRetTask and SearchTravTask take a numeric parameter,
* indicating the required traversal size.
*
- SearchTravRetLoadFieldSelectorTask takes a string parameter: a comma separated
* list of Fields to load.
*
- SearchTravRetHighlighterTask takes a string parameter: a comma separated list
* of parameters to define highlighting. See that tasks javadocs for more information
*
*
* Example - AddDoc(2000) - would add a document of size
* 2000 (~bytes).
* See conf/task-sample.alg for how this can be used, for instance, to check which is faster,
* adding many smaller documents, or few larger documents. Next candidates for supporting a
* parameter may be the Search tasks, for controlling the query size.
* - Statistic recording elimination: - a sequence can also end with '>', in which case child tasks would not store their statistics. This can
* be useful to avoid exploding stats data, for adding say 1M docs.
* Example - { "ManyAdds" AddDoc > : 1000000 - would
* add million docs, measure that total, but not save stats for each addDoc.
* Notice that the granularity of System.currentTimeMillis() (which is used here) is system
* dependant, and in some systems an operation that takes 5 ms to complete may show 0 ms
* latency time in performance measurements. Therefore it is sometimes more accurate to look
* at the elapsed time of a larger sequence, as demonstrated here.
* - Rate: To set a rate (ops/sec or ops/min) for a sequence, add ': N : R' just after sequence closing tag. This would specify repetition of
* N with rate of R operations/sec. Use 'R/sec' or 'R/min' to explicitly specify that the rate is per second or
* per minute. The default is per second,
* Example - [ AddDoc ] : 400 : 3 - would do 400 addDoc in
* parallel, starting up to 3 threads per second.
* Example - { AddDoc } : 100 : 200/min - would do 100
* addDoc serially, waiting before starting next add, if otherwise rate would exceed 200
* adds/min.
* - Disable Counting: Each task executed contributes to the records count. This count is
* reflected in reports under recs/s and under recsPerRun. Most tasks count 1, some count 0,
* and some count more. (See Results record counting clarified for
* more details.) It is possible to disable counting for a task by preceding it with -.
* Example - -CreateIndex - would count 0 while the
* default behavior for CreateIndex is to count 1.
* - Command names: Each class "AnyNameTask" in the package
* org.apache.lucene.benchmark.byTask.tasks, that extends PerfTask, is supported as command
* "AnyName" that can be used in the benchmark "algorithm" description. This allows to add new
* commands by just adding such classes.
*
*
*
*
* Supported tasks/commands
*
* Existing tasks can be divided into a few groups: regular index/search work tasks, report
* tasks, and control tasks.
*
*
* - Report tasks: There are a few Report commands for generating reports. Only task runs
* that were completed are reported. (The 'Report tasks' themselves are not measured and not
* reported.)
*
* - RepAll - all (completed) task runs.
*
- RepSumByName - all statistics, aggregated by
* name. So, if AddDoc was executed 2000 times, only 1 report line would be created for
* it, aggregating all those 2000 statistic records.
*
- RepSelectByPref prefixWord - all records
* for tasks whose name start with prefixWord.
*
- RepSumByPref prefixWord - all records for
* tasks whose name start with prefixWord,
* aggregated by their full task name.
*
- RepSumByNameRound - all statistics, aggregated by
* name and by Round. So, if AddDoc was executed
* 2000 times in each of 3 rounds, 3 report lines
* would be created for it, aggregating all those 2000 statistic records in each round.
* See more about rounds in the NewRound command
* description below.
*
- RepSumByPrefRound prefixWord - similar to
* RepSumByNameRound, just that only tasks whose
* name starts with prefixWord are included.
*
* If needed, additional reports can be added by extending the abstract class ReportTask, and
* by manipulating the statistics data in Points and TaskStats.
* - Control tasks: Few of the tasks control the benchmark algorithm all over:
*
* - ClearStats - clears the entire statistics.
* Further reports would only include task runs that would start after this call.
*
- NewRound - virtually start a new round of
* performance test. Although this command can be placed anywhere, it mostly makes sense
* at the end of an outermost sequence.
* This increments a global "round counter". All task runs that would start now would
* record the new, updated round counter as their round number. This would appear in
* reports. In particular, see RepSumByNameRound
* above.
* An additional effect of NewRound, is that numeric and boolean properties defined (at
* the head of the .alg file) as a sequence of values, e.g. merge.factor=mrg:10:100:10:100 would increment (cyclic) to the next
* value. Note: this would also be reflected in the reports, in this case under a column
* that would be named "mrg".
* - ResetInputs - DocMaker and the various
* QueryMakers would reset their counters to start. The way these Maker interfaces work,
* each call for makeDocument() or makeQuery() creates the next document or query that
* it "knows" to create. If that pool is "exhausted", the "maker" start over again. The
* ResetInputs command therefore allows to make the rounds comparable. It is therefore
* useful to invoke ResetInputs together with NewRound.
*
- ResetSystemErase - reset all index and input data
* and call gc. Does NOT reset statistics. This contains ResetInputs. All
* writers/readers are nullified, deleted, closed. Index is erased. Directory is erased.
* You would have to call CreateIndex once this was called...
*
- ResetSystemSoft - reset all index and input data
* and call gc. Does NOT reset statistics. This contains ResetInputs. All
* writers/readers are nullified, closed. Index is NOT erased. Directory is NOT erased.
* This is useful for testing performance on an existing index, for instance if the
* construction of a large index took a very long time and now you would to test its
* search or update performance.
*
* - Other existing tasks are quite straightforward and would just be briefly described here.
*
* - CreateIndex and OpenIndex both leave the index open for later update operations.
* CloseIndex would close it.
*
- OpenReader, similarly, would leave an index
* reader open for later search operations. But this have further semantics. If a Read
* operation is performed, and an open reader exists, it would be used. Otherwise, the
* read operation would open its own reader and close it when the read operation is
* done. This allows testing various scenarios - sharing a reader, searching with "cold"
* reader, with "warmed" reader, etc. The read operations affected by this are: Warm, Search, SearchTrav (search and traverse), and SearchTravRet (search and traverse and retrieve).
* Notice that each of the 3 search task types maintains its own queryMaker instance.
*
- CommitIndex and ForceMerge can be used to commit changes to the index then merge the
* index segments. The integer parameter specifies how many segments to merge down to
* (default 1).
*
- WriteLineDoc prepares a 'line' file where each
* line holds a document with title, date and body elements,
* separated by [TAB]. A line file is useful if one wants to measure pure indexing
* performance, without the overhead of parsing the data.
* You can use LineDocSource as a ContentSource over a 'line' file.
* - ConsumeContentSource consumes a ContentSource.
* Useful for e.g. testing a ContentSource performance, without the overhead of
* preparing a Document out of it.
*
*
*
*
*
* Benchmark properties
*
* Properties are read from the header of the .alg file, and define several parameters of the
* performance test. As mentioned above for the NewRound task,
* numeric and boolean properties that are defined as a sequence of values, e.g. merge.factor=mrg:10:100:10:100 would increment (cyclic) to the next value, when
* NewRound is called, and would also appear as a named column in the reports (column name would be
* "mrg" in this example).
*
*
Some of the currently defined properties are:
*
*
* - analyzer - full class name for the analyzer to use.
* Same analyzer would be used in the entire test.
*
- directory - valid values are This tells which directory
* to use for the performance test.
*
- Index work parameters: Multi int/boolean values would be iterated with calls to
* NewRound. There would be also added as columns in the reports, first string in the sequence
* is the column name. (Make sure it is no shorter than any value in the sequence).
*
* - max.buffered
* Example: max.buffered=buf:10:10:100:100 - this would define using maxBufferedDocs of
* 10 in iterations 0 and 1, and 100 in iterations 2 and 3.
* - merge.factor - which merge factor to use.
*
- compound - whether the index is using the
* compound format or not. Valid values are "true" and "false".
*
*
*
* Here is a list of currently defined properties:
*
*
* - Root directory for data and indexes:
*
* - work.dir (default is System property "benchmark.work.dir" or "work".)
*
* - Docs and queries creation:
*
* - analyzer
*
- doc.maker
*
- content.source.forever
*
- html.parser
*
- doc.stored
*
- doc.tokenized
*
- doc.term.vector
*
- doc.term.vector.positions
*
- doc.term.vector.offsets
*
- doc.store.body.bytes
*
- docs.dir
*
- query.maker
*
- file.query.maker.file
*
- file.query.maker.default.field
*
- search.num.hits
*
* - Logging:
*
* - log.step
*
- log.step.[class name]Task ie log.step.DeleteDoc (e.g. log.step.Wonderful for the
* WonderfulTask example above).
*
- log.queries
*
- task.max.depth.log
*
* - Index writing:
*
* - compound
*
- merge.factor
*
- max.buffered
*
- directory
*
- ram.flush.mb
*
- codec.postingsFormat (eg Direct) Note: no codec should be specified through
* default.codec
*
* - Doc deletion:
*
* - doc.delete.step
*
* - Spatial: Numerous; see spatial.alg
*
- Task alternative packages:
*
* - alt.tasks.packages - comma separated list of additional packages where tasks classes
* will be looked for when not found in the default package (that of PerfTask). If the
* same task class appears in more than one package, the package indicated first in this
* list will be used.
*
*
*
* For sample use of these properties see the *.alg files under conf.
*
*
Example input algorithm and the result benchmark report
*
* The following example is in conf/sample.alg:
*
*
* # --------------------------------------------------------
* #
* # Sample: what is the effect of doc size on indexing time?
* #
* # There are two parts in this test:
* # - PopulateShort adds 2N documents of length L
* # - PopulateLong adds N documents of length 2L
* # Which one would be faster?
* # The comparison is done twice.
* #
* # --------------------------------------------------------
*
* # -------------------------------------------------------------------------------------
* # multi val params are iterated by NewRound's, added to reports, start with column name.
* merge.factor=mrg:10:20
* max.buffered=buf:100:1000
* compound=true
*
* analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
* directory=FSDirectory
*
* doc.stored=true
* doc.tokenized=true
* doc.term.vector=false
* doc.add.log.step=500
*
* docs.dir=reuters-out
*
* doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
*
* query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
*
* # task at this depth or less would print when they start
* task.max.depth.log=2
*
* log.queries=false
* # -------------------------------------------------------------------------------------
* {
*
* { "PopulateShort"
* CreateIndex
* { AddDoc(4000) > : 20000
* Optimize
* CloseIndex
* >
*
* ResetSystemErase
*
* { "PopulateLong"
* CreateIndex
* { AddDoc(8000) > : 10000
* Optimize
* CloseIndex
* >
*
* ResetSystemErase
*
* NewRound
*
* } : 2
*
* RepSumByName
* RepSelectByPref Populate
*
*
*
* The command line for running this sample:
* ./gradlew -p lucene/benchmark getReuters run -Ptask.alg=conf/sample.alg
*
*
The output report from running this test contains the following:
*
*
* Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
* PopulateShort 0 10 100 1 20003 119.6 167.26 12,959,120 14,241,792
* PopulateLong - - 0 10 100 - - 1 - - 10003 - - - 74.3 - - 134.57 - 17,085,208 - 20,635,648
* PopulateShort 1 20 1000 1 20003 143.5 139.39 63,982,040 94,756,864
* PopulateLong - - 1 20 1000 - - 1 - - 10003 - - - 77.0 - - 129.92 - 87,309,608 - 100,831,232
*
*
*
*
* Results record counting clarified
*
* Two columns in the results table indicate records counts: records-per-run and
* records-per-second. What does it mean?
*
*
Almost every task gets 1 in this count just for being executed. Task sequences aggregate the
* counts of their child tasks, plus their own count of 1. So, a task sequence containing 5 other
* task sequences, each running a single other task 10 times, would have a count of 1 + 5 * (1 + 10)
* = 56.
*
*
The traverse and retrieve tasks "count" more: a traverse task would add 1 for each traversed
* result (hit), and a retrieve task would additionally add 1 for each retrieved doc. So, regular
* Search would count 1, SearchTrav that traverses 10 hits would count 11, and a SearchTravRet task
* that retrieves (and traverses) 10, would count 21.
*
*
Confusing? this might help: always examine the elapsedSec
column, and always
* compare "apples to apples", .i.e. it is interesting to check how the rec/s
changed
* for the same task (or sequence) between two different runs, but it is not very useful to know how
* the rec/s
differs between Search
and SearchTrav
tasks. For
* the latter, elapsedSec
would bring more insight.
*/
package org.apache.lucene.benchmark.byTask;