org.apache.lucene.benchmark.byTask.package-info Maven / Gradle / Ivy
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Benchmarking Lucene By Tasks
*
* This package provides "task based" performance benchmarking of Lucene.
* One can use the predefined benchmarks, or create new ones.
*
*
* Contained packages:
*
*
*
*
* Package
* Description
*
*
* stats
* Statistics maintained when running benchmark tasks.
*
*
* tasks
* Benchmark tasks.
*
*
* feeds
* Sources for benchmark inputs: documents and queries.
*
*
* utils
* Utilities used for the benchmark, and for the reports.
*
*
* programmatic
* Sample performance test written programmatically.
*
*
*
* Table Of Contents
*
* - Benchmarking By Tasks
* - How to use
* - Benchmark "algorithm"
* - Supported tasks/commands
* - Benchmark properties
* - Example input algorithm and the result benchmark
* report.
* - Results record counting clarified
*
*
* Benchmarking By Tasks
*
* Benchmark Lucene using task primitives.
*
*
*
* A benchmark is composed of some predefined tasks, allowing for creating an
* index, adding documents,
* optimizing, searching, generating reports, and more. A benchmark run takes an
* "algorithm" file
* that contains a description of the sequence of tasks making up the run, and some
* properties defining a few
* additional characteristics of the benchmark run.
*
*
*
* How to use
*
* Easiest way to run a benchmarks is using the predefined ant task:
*
* - ant run-task
*
- would run the micro-standard.alg
"algorithm".
*
* - ant run-task -Dtask.alg=conf/compound-penalty.alg
*
- would run the compound-penalty.alg
"algorithm".
*
* - ant run-task -Dtask.alg=[full-path-to-your-alg-file]
*
- would run your perf test
"algorithm".
*
* - java org.apache.lucene.benchmark.byTask.programmatic.Sample
*
- would run a performance test programmatically - without using an alg
* file. This is less readable, and less convenient, but possible.
*
*
*
*
* You may find existing tasks sufficient for defining the benchmark you
* need, otherwise, you can extend the framework to meet your needs, as explained
* herein.
*
*
*
* Each benchmark run has a DocMaker and a QueryMaker. These two should usually
* match, so that "meaningful" queries are used for a certain collection.
* Properties set at the header of the alg file define which "makers" should be
* used. You can also specify your own makers, extending DocMaker and implementing
* QueryMaker.
*
* Note: since 2.9, DocMaker is a concrete class which accepts a
* ContentSource. In most cases, you can use the DocMaker class to create
* Documents, while providing your own ContentSource implementation. For
* example, the current Benchmark package includes ContentSource
* implementations for TREC, Enwiki and Reuters collections, as well as
* others like LineDocSource which reads a 'line' file produced by
* WriteLineDocTask.
*
*
*
* Benchmark .alg file contains the benchmark "algorithm". The syntax is described
* below. Within the algorithm, you can specify groups of commands, assign them
* names, specify commands that should be repeated,
* do commands in serial or in parallel,
* and also control the speed of "firing" the commands.
*
*
*
* This allows, for instance, to specify
* that an index should be opened for update,
* documents should be added to it one by one but not faster than 20 docs a minute,
* and, in parallel with this,
* some N queries should be searched against that index,
* again, no more than 2 queries a second.
* You can have the searches all share an index reader,
* or have them each open its own reader and close it afterwords.
*
*
*
* If the commands available for use in the algorithm do not meet your needs,
* you can add commands by adding a new task under
* org.apache.lucene.benchmark.byTask.tasks -
* you should extend the PerfTask abstract class.
* Make sure that your new task class name is suffixed by Task.
* Assume you added the class "WonderfulTask" - doing so also enables the
* command "Wonderful" to be used in the algorithm.
*
*
*
* External classes: It is sometimes useful to invoke the benchmark
* package with your external alg file that configures the use of your own
* doc/query maker and or html parser. You can work this out without
* modifying the benchmark package code, by passing your class path
* with the benchmark.ext.classpath property:
*
* - ant run-task -Dtask.alg=[full-path-to-your-alg-file]
* -Dbenchmark.ext.classpath=/mydir/classes
* -Dtask.mem=512M
*
*
* External tasks: When writing your own tasks under a package other than
* org.apache.lucene.benchmark.byTask.tasks specify that package thru the
* alt.tasks.packages property.
*
*
*
Benchmark "algorithm"
*
*
* The following is an informal description of the supported syntax.
*
*
*
* -
* Measuring: When a command is executed, statistics for the elapsed
* execution time and memory consumption are collected.
* At any time, those statistics can be printed, using one of the
* available ReportTasks.
*
* -
* Comments start with '#'.
*
* -
* Serial sequences are enclosed within '{ }'.
*
* -
* Parallel sequences are enclosed within
* '[ ]'
*
* -
* Sequence naming: To name a sequence, put
* '"name"' just after
* '{' or '['.
*
Example - { "ManyAdds" AddDoc } : 1000000 -
* would
* name the sequence of 1M add docs "ManyAdds", and this name would later appear
* in statistic reports.
* If you don't specify a name for a sequence, it is given one: you can see it as
* the algorithm is printed just before benchmark execution starts.
*
* -
* Repeating:
* To repeat sequence tasks N times, add ': N' just
* after the
* sequence closing tag - '}' or
* ']' or '>'.
*
Example - [ AddDoc ] : 4 - would do 4 addDoc
* in parallel, spawning 4 threads at once.
*
Example - [ AddDoc AddDoc ] : 4 - would do
* 8 addDoc in parallel, spawning 8 threads at once.
*
Example - { AddDoc } : 30 - would do addDoc
* 30 times in a row.
*
Example - { AddDoc AddDoc } : 30 - would do
* addDoc 60 times in a row.
*
Exhaustive repeating: use * instead of
* a number to repeat exhaustively.
* This is sometimes useful, for adding as many files as a doc maker can create,
* without iterating over the same file again, especially when the exact
* number of documents is not known in advance. For instance, TREC files extracted
* from a zip file. Note: when using this, you must also set
* content.source.forever to false.
*
Example - { AddDoc } : * - would add docs
* until the doc maker is "exhausted".
*
* -
* Command parameter: a command can optionally take a single parameter.
* If the certain command does not support a parameter, or if the parameter is of
* the wrong type,
* reading the algorithm will fail with an exception and the test would not start.
* Currently the following tasks take optional parameters:
*
* - AddDoc takes a numeric parameter, indicating the required size of
* added document. Note: if the DocMaker implementation used in the test
* does not support makeDoc(size), an exception would be thrown and the test
* would fail.
*
* - DeleteDoc takes numeric parameter, indicating the docid to be
* deleted. The latter is not very useful for loops, since the docid is
* fixed, so for deletion in loops it is better to use the
*
doc.delete.step
property.
*
* - SetProp takes a
name,value
mandatory param,
* ',' used as a separator.
*
* - SearchTravRetTask and SearchTravTask take a numeric
* parameter, indicating the required traversal size.
*
* - SearchTravRetLoadFieldSelectorTask takes a string
* parameter: a comma separated list of Fields to load.
*
* - SearchTravRetHighlighterTask takes a string
* parameter: a comma separated list of parameters to define highlighting. See that
* tasks javadocs for more information
*
*
*
Example - AddDoc(2000) - would add a document
* of size 2000 (~bytes).
*
See conf/task-sample.alg for how this can be used, for instance, to check
* which is faster, adding
* many smaller documents, or few larger documents.
* Next candidates for supporting a parameter may be the Search tasks,
* for controlling the query size.
*
* -
* Statistic recording elimination: - a sequence can also end with
* '>',
* in which case child tasks would not store their statistics.
* This can be useful to avoid exploding stats data, for adding say 1M docs.
*
Example - { "ManyAdds" AddDoc > : 1000000 -
* would add million docs, measure that total, but not save stats for each addDoc.
*
Notice that the granularity of System.currentTimeMillis() (which is used
* here) is system dependant,
* and in some systems an operation that takes 5 ms to complete may show 0 ms
* latency time in performance measurements.
* Therefore it is sometimes more accurate to look at the elapsed time of a larger
* sequence, as demonstrated here.
*
* -
* Rate:
* To set a rate (ops/sec or ops/min) for a sequence, add
* ': N : R' just after sequence closing tag.
* This would specify repetition of N with rate of R operations/sec.
* Use 'R/sec' or
* 'R/min'
* to explicitly specify that the rate is per second or per minute.
* The default is per second,
*
Example - [ AddDoc ] : 400 : 3 - would do 400
* addDoc in parallel, starting up to 3 threads per second.
*
Example - { AddDoc } : 100 : 200/min - would
* do 100 addDoc serially,
* waiting before starting next add, if otherwise rate would exceed 200 adds/min.
*
* -
* Disable Counting: Each task executed contributes to the records count.
* This count is reflected in reports under recs/s and under recsPerRun.
* Most tasks count 1, some count 0, and some count more.
* (See Results record counting clarified for more details.)
* It is possible to disable counting for a task by preceding it with -.
*
Example - -CreateIndex - would count 0 while
* the default behavior for CreateIndex is to count 1.
*
* -
* Command names: Each class "AnyNameTask" in the
* package org.apache.lucene.benchmark.byTask.tasks,
* that extends PerfTask, is supported as command "AnyName" that can be
* used in the benchmark "algorithm" description.
* This allows to add new commands by just adding such classes.
*
*
*
*
*
* Supported tasks/commands
*
*
* Existing tasks can be divided into a few groups:
* regular index/search work tasks, report tasks, and control tasks.
*
*
*
*
* -
* Report tasks: There are a few Report commands for generating reports.
* Only task runs that were completed are reported.
* (The 'Report tasks' themselves are not measured and not reported.)
*
* -
* RepAll - all (completed) task runs.
*
* -
* RepSumByName - all statistics,
* aggregated by name. So, if AddDoc was executed 2000 times,
* only 1 report line would be created for it, aggregating all those
* 2000 statistic records.
*
* -
* RepSelectByPref prefixWord - all
* records for tasks whose name start with
* prefixWord.
*
* -
* RepSumByPref prefixWord - all
* records for tasks whose name start with
* prefixWord,
* aggregated by their full task name.
*
* -
* RepSumByNameRound - all statistics,
* aggregated by name and by Round.
* So, if AddDoc was executed 2000 times in each of 3
* rounds, 3 report lines would be
* created for it,
* aggregating all those 2000 statistic records in each round.
* See more about rounds in the NewRound
* command description below.
*
* -
* RepSumByPrefRound prefixWord -
* similar to RepSumByNameRound,
* just that only tasks whose name starts with
* prefixWord are included.
*
*
* If needed, additional reports can be added by extending the abstract class
* ReportTask, and by
* manipulating the statistics data in Points and TaskStats.
*
*
* - Control tasks: Few of the tasks control the benchmark algorithm
* all over:
*
* -
* ClearStats - clears the entire statistics.
* Further reports would only include task runs that would start after this
* call.
*
* -
* NewRound - virtually start a new round of
* performance test.
* Although this command can be placed anywhere, it mostly makes sense at
* the end of an outermost sequence.
*
This increments a global "round counter". All task runs that
* would start now would
* record the new, updated round counter as their round number.
* This would appear in reports.
* In particular, see RepSumByNameRound above.
*
An additional effect of NewRound, is that numeric and boolean
* properties defined (at the head
* of the .alg file) as a sequence of values, e.g.
* merge.factor=mrg:10:100:10:100 would
* increment (cyclic) to the next value.
* Note: this would also be reflected in the reports, in this case under a
* column that would be named "mrg".
*
* -
* ResetInputs - DocMaker and the
* various QueryMakers
* would reset their counters to start.
* The way these Maker interfaces work, each call for makeDocument()
* or makeQuery() creates the next document or query
* that it "knows" to create.
* If that pool is "exhausted", the "maker" start over again.
* The ResetInputs command
* therefore allows to make the rounds comparable.
* It is therefore useful to invoke ResetInputs together with NewRound.
*
* -
* ResetSystemErase - reset all index
* and input data and call gc.
* Does NOT reset statistics. This contains ResetInputs.
* All writers/readers are nullified, deleted, closed.
* Index is erased.
* Directory is erased.
* You would have to call CreateIndex once this was called...
*
* -
* ResetSystemSoft - reset all
* index and input data and call gc.
* Does NOT reset statistics. This contains ResetInputs.
* All writers/readers are nullified, closed.
* Index is NOT erased.
* Directory is NOT erased.
* This is useful for testing performance on an existing index,
* for instance if the construction of a large index
* took a very long time and now you would to test
* its search or update performance.
*
*
*
*
* -
* Other existing tasks are quite straightforward and would
* just be briefly described here.
*
* -
* CreateIndex and
* OpenIndex both leave the
* index open for later update operations.
* CloseIndex would close it.
*
-
* OpenReader, similarly, would
* leave an index reader open for later search operations.
* But this have further semantics.
* If a Read operation is performed, and an open reader exists,
* it would be used.
* Otherwise, the read operation would open its own reader
* and close it when the read operation is done.
* This allows testing various scenarios - sharing a reader,
* searching with "cold" reader, with "warmed" reader, etc.
* The read operations affected by this are:
* Warm,
* Search,
* SearchTrav (search and traverse),
* and SearchTravRet (search
* and traverse and retrieve).
* Notice that each of the 3 search task types maintains
* its own queryMaker instance.
*
-
* CommitIndex and
* ForceMerge can be used to commit
* changes to the index then merge the index segments. The integer
* parameter specifies how many segments to merge down to (default
* 1).
*
-
* WriteLineDoc prepares a 'line'
* file where each line holds a document with title,
* date and body elements, separated by [TAB].
* A line file is useful if one wants to measure pure indexing
* performance, without the overhead of parsing the data.
* You can use LineDocSource as a ContentSource over a 'line'
* file.
* -
* ConsumeContentSource consumes
* a ContentSource. Useful for e.g. testing a ContentSource
* performance, without the overhead of preparing a Document
* out of it.
*
*
*
*
*
* Benchmark properties
*
*
* Properties are read from the header of the .alg file, and
* define several parameters of the performance test.
* As mentioned above for the NewRound task,
* numeric and boolean properties that are defined as a sequence
* of values, e.g. merge.factor=mrg:10:100:10:100
* would increment (cyclic) to the next value,
* when NewRound is called, and would also
* appear as a named column in the reports (column
* name would be "mrg" in this example).
*
*
*
* Some of the currently defined properties are:
*
*
*
* -
* analyzer - full
* class name for the analyzer to use.
* Same analyzer would be used in the entire test.
*
*
* -
* directory - valid values are
* This tells which directory to use for the performance test.
*
*
* -
* Index work parameters:
* Multi int/boolean values would be iterated with calls to NewRound.
* There would be also added as columns in the reports, first string in the
* sequence is the column name.
* (Make sure it is no shorter than any value in the sequence).
*
* - max.buffered
*
Example: max.buffered=buf:10:10:100:100 -
* this would define using maxBufferedDocs of 10 in iterations 0 and 1,
* and 100 in iterations 2 and 3.
*
* -
* merge.factor - which
* merge factor to use.
*
* -
* compound - whether the index is
* using the compound format or not. Valid values are "true" and "false".
*
*
*
*
*
* Here is a list of currently defined properties:
*
*
*
* - Root directory for data and indexes:
*
- work.dir (default is System property "benchmark.work.dir" or "work".)
*
*
*
* - Docs and queries creation:
*
- analyzer
*
- doc.maker
*
- content.source.forever
*
- html.parser
*
- doc.stored
*
- doc.tokenized
*
- doc.term.vector
*
- doc.term.vector.positions
*
- doc.term.vector.offsets
*
- doc.store.body.bytes
*
- docs.dir
*
- query.maker
*
- file.query.maker.file
*
- file.query.maker.default.field
*
- search.num.hits
*
*
*
* - Logging:
*
- log.step
*
- log.step.[class name]Task ie log.step.DeleteDoc (e.g. log.step.Wonderful for the WonderfulTask example above).
*
- log.queries
*
- task.max.depth.log
*
*
*
* - Index writing:
*
- compound
*
- merge.factor
*
- max.buffered
*
- directory
*
- ram.flush.mb
*
- codec.postingsFormat (eg Direct) Note: no codec should be specified through default.codec
*
*
*
* - Doc deletion:
*
- doc.delete.step
*
*
*
* - Spatial: Numerous; see spatial.alg
*
*
* - Task alternative packages:
*
- alt.tasks.packages
* - comma separated list of additional packages where tasks classes will be looked for
* when not found in the default package (that of PerfTask). If the same task class
* appears in more than one package, the package indicated first in this list will be used.
*
*
*
*
*
*
* For sample use of these properties see the *.alg files under conf.
*
*
*
* Example input algorithm and the result benchmark report
*
* The following example is in conf/sample.alg:
*
* # --------------------------------------------------------
* #
* # Sample: what is the effect of doc size on indexing time?
* #
* # There are two parts in this test:
* # - PopulateShort adds 2N documents of length L
* # - PopulateLong adds N documents of length 2L
* # Which one would be faster?
* # The comparison is done twice.
* #
* # --------------------------------------------------------
*
* # -------------------------------------------------------------------------------------
* # multi val params are iterated by NewRound's, added to reports, start with column name.
* merge.factor=mrg:10:20
* max.buffered=buf:100:1000
* compound=true
*
* analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
* directory=FSDirectory
*
* doc.stored=true
* doc.tokenized=true
* doc.term.vector=false
* doc.add.log.step=500
*
* docs.dir=reuters-out
*
* doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
*
* query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
*
* # task at this depth or less would print when they start
* task.max.depth.log=2
*
* log.queries=false
* # -------------------------------------------------------------------------------------
* {
*
* { "PopulateShort"
* CreateIndex
* { AddDoc(4000) > : 20000
* Optimize
* CloseIndex
* >
*
* ResetSystemErase
*
* { "PopulateLong"
* CreateIndex
* { AddDoc(8000) > : 10000
* Optimize
* CloseIndex
* >
*
* ResetSystemErase
*
* NewRound
*
* } : 2
*
* RepSumByName
* RepSelectByPref Populate
*
*
*
*
* The command line for running this sample:
*
ant run-task -Dtask.alg=conf/sample.alg
*
*
*
* The output report from running this test contains the following:
*
* Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
* PopulateShort 0 10 100 1 20003 119.6 167.26 12,959,120 14,241,792
* PopulateLong - - 0 10 100 - - 1 - - 10003 - - - 74.3 - - 134.57 - 17,085,208 - 20,635,648
* PopulateShort 1 20 1000 1 20003 143.5 139.39 63,982,040 94,756,864
* PopulateLong - - 1 20 1000 - - 1 - - 10003 - - - 77.0 - - 129.92 - 87,309,608 - 100,831,232
*
*
*
* Results record counting clarified
*
* Two columns in the results table indicate records counts: records-per-run and
* records-per-second. What does it mean?
*
* Almost every task gets 1 in this count just for being executed.
* Task sequences aggregate the counts of their child tasks,
* plus their own count of 1.
* So, a task sequence containing 5 other task sequences, each running a single
* other task 10 times, would have a count of 1 + 5 * (1 + 10) = 56.
*
* The traverse and retrieve tasks "count" more: a traverse task
* would add 1 for each traversed result (hit), and a retrieve task would
* additionally add 1 for each retrieved doc. So, regular Search would
* count 1, SearchTrav that traverses 10 hits would count 11, and a
* SearchTravRet task that retrieves (and traverses) 10, would count 21.
*
* Confusing? this might help: always examine the elapsedSec
column,
* and always compare "apples to apples", .i.e. it is interesting to check how the
* rec/s
changed for the same task (or sequence) between two
* different runs, but it is not very useful to know how the rec/s
* differs between Search
and SearchTrav
tasks. For
* the latter, elapsedSec
would bring more insight.
*
*/
package org.apache.lucene.benchmark.byTask;