ai.platon.pulsar.dom.select.PowerSelector.kt Maven / Gradle / Ivy
package ai.platon.pulsar.dom.select
import ai.platon.pulsar.common.brief
import ai.platon.pulsar.common.concurrent.ConcurrentExpiringLRUCache
import ai.platon.pulsar.common.urls.UrlUtils
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
import org.jsoup.select.Evaluator
import org.slf4j.LoggerFactory
import java.time.Duration
import java.util.*
import java.util.concurrent.atomic.AtomicInteger
class PowerSelectorParseException(msg: String, vararg params: Any) : IllegalArgumentException(String.format(msg, *params))
/**
* CSS element selector, that finds elements matching a query.
*
* Selector syntax
*
* A selector is a chain of simple selectors, separated by combinators. Selectors are **case insensitive** (including against
* elements, attributes, and attribute values).
*
* The universal selector (*) is implicit when no element selector is supplied (i.e. `*.header` and `.header`
* is equivalent).
*
*
* Pattern Matches Example
* `*` any element `*`
* `tag` elements with the given tag name `div`
* `*|E` elements of type E in any namespace *ns* `*|name` finds `` elements
* `ns|E` elements of type E in the namespace *ns* `fb|name` finds `` elements
* `#id` elements with attribute ID of "id" `div#wrap`, `#logo`
* `.class` elements with a class name of "class" `div.left`, `.result`
* `[attr]` elements with an attribute named "attr" (with any value) `a[href]`, `[title]`
* `[^attrPrefix]` elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets `[^data-]`, `div[^data-]`
* `[attr=val]` elements with an attribute named "attr", and value equal to "val" `img[width=500]`, `a[rel=nofollow]`
* `[attr="val"]` elements with an attribute named "attr", and value equal to "val" `span[hello="Cleveland"][goodbye="Columbus"]`, `a[rel="nofollow"]`
* `[attr^=valPrefix]` elements with an attribute named "attr", and value starting with "valPrefix" `a[href^=http:]`
* `[attr$=valSuffix]` elements with an attribute named "attr", and value ending with "valSuffix" `img[src$=.png]`
* `[attr*=valContaining]` elements with an attribute named "attr", and value containing "valContaining" `a[href*=/search/]`
* `[attr~=*regex*]` elements with an attribute named "attr", and value matching the regular expression `img[src~=(?i)\\.(png|jpe?g)]`
* The above may be combined in any order `div.header[title]`
* Combinators
* `E F` an F element descended from an E element `div a`, `.logo h1`
* `E > F` an F direct child of E `ol > li`
* `E + F` an F element immediately preceded by sibling E `li + li`, `div.head + div`
* `E ~ F` an F element preceded by sibling E `h1 ~ p`
* `E, F, G` all matching elements E, F, or G `a[href], div, h3`
* Pseudo selectors
* `:lt(*n*)` elements whose sibling index is less than *n* `td:lt(3)` finds the first 3 cells of each row
* `:gt(*n*)` elements whose sibling index is greater than *n* `td:gt(1)` finds cells after skipping the first two
* `:eq(*n*)` elements whose sibling index is equal to *n* `td:eq(0)` finds the first cell of each row
* `:has(*selector*)` elements that contains at least one element matching the *selector* `div:has(p)` finds divs that contain p elements
* `:not(*selector*)` elements that do not match the *selector*. See also [Elements.not] `div:not(.logo)` finds all divs that do not have the "logo" class.
*
*`div:not(:has(div))` finds divs that do not contain divs.
* `:contains(*text*)` elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. `p:contains(dom)` finds p elements containing the text "dom".
* `:matches(*regex*)` elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. `td:matches(\\d+)` finds table cells containing digits. `div:matches((?i)login)` finds divs containing the text, case insensitively.
* `:containsOwn(*text*)` elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. `p:containsOwn(dom)` finds p elements with own text "dom".
* `:matchesOwn(*regex*)` elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. `td:matchesOwn(\\d+)` finds table cells directly containing digits. `div:matchesOwn((?i)login)` finds divs containing the text, case insensitively.
* `:containsData(*data*)` elements that contains the specified *data*. The contents of `script` and `style` elements, and `comment` nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants. `script:contains(dom)` finds script elements containing the data "dom".
* The above may be combined in any order and with other selectors `.light:contains(name):eq(0)`
* Structural pseudo selectors
* `:root` The element that is the root of the document. In HTML, this is the `html` element `:root`
* `:nth-child(*a*n+*b*)`
*
*elements that have `*a*n+*b*-1` siblings **before** it in the document tree, for any positive integer or zero value of `n`, and has a parent element. For values of `a` and `b` greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the *b*th element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The `a` and `b` values must be integers (positive, negative, or zero). The index of the first child of an element is 1.
* In addition to this, `:nth-child()` can take `odd` and `even` as arguments instead. `odd` has the same signification as `2n+1`, and `even` has the same signification as `2n`. `tr:nth-child(2n+1)` finds every odd row of a table. `:nth-child(10n-1)` the 9th, 19th, 29th, etc, element. `li:nth-child(5)` the 5h li
* `:nth-last-child(*a*n+*b*)` elements that have `*a*n+*b*-1` siblings **after** it in the document tree. Otherwise like `:nth-child()` `tr:nth-last-child(-n+2)` the last two rows of a table
* `:nth-of-type(*a*n+*b*)` pseudo-class notation represents an element that has `*a*n+*b*-1` siblings with the same expanded element name *before* it in the document tree, for any zero or positive integer value of n, and has a parent element `img:nth-of-type(2n+1)`
* `:nth-last-of-type(*a*n+*b*)` pseudo-class notation represents an element that has `*a*n+*b*-1` siblings with the same expanded element name *after* it in the document tree, for any zero or positive integer value of n, and has a parent element `img:nth-last-of-type(2n+1)`
* `:first-child` elements that are the first child of some other element. `div > p:first-child`
* `:last-child` elements that are the last child of some other element. `ol > li:last-child`
* `:first-of-type` elements that are the first sibling of its type in the list of children of its parent element `dl dt:first-of-type`
* `:last-of-type` elements that are the last sibling of its type in the list of children of its parent element `tr > td:last-of-type`
* `:only-child` elements that have a parent element and whose parent element hasve no other element children
* `:only-of-type` an element that has a parent element and whose parent element has no other element children with the same expanded element name
* `:empty` elements that have no children at all
*
* @author Jonathan Hedley, [email protected]
* @see Element.select
*/
object PowerSelector {
private val logger = LoggerFactory.getLogger(PowerSelector::class.java)
private val cache = ConcurrentExpiringLRUCache(Duration.ofMinutes(10))
private val parseExceptions = ConcurrentExpiringLRUCache(Duration.ofMinutes(10))
private val totalParseExceptions = ConcurrentExpiringLRUCache(Duration.ofMinutes(10))
/**
* Find elements matching selector.
*
* @param cssQuery A CSS query
* @param root root element to descend into
* @return matching elements, empty if none
*/
fun select(cssQuery: String, root: Element): Elements {
val cssQuery0 = cssQuery.trim()
if (cssQuery0.isBlank()) {
return Elements()
}
val evaluator = parseOrNullCached(cssQuery0, root.baseUri()) ?: return Elements()
return select(evaluator, root)
}
fun select(cssQuery: String, root: Element, offset: Int = 1, limit: Int = Int.MAX_VALUE): Elements {
checkArguments(cssQuery, offset, limit)
return select(cssQuery, root).asSequence().drop(offset - 1).take(limit).toCollection(Elements())
}
fun select(cssQuery: String,
root: Element, offset: Int = 1, limit: Int = Int.MAX_VALUE, transformer: (Element) -> O): List {
checkArguments(cssQuery, offset, limit)
// TODO: do the filter inside Collector.collect
return select(cssQuery, root).asSequence().drop(offset - 1).take(limit).map { transformer(it) }.toList()
}
/**
* Find elements matching selector.
*
* @param cssQuery CSS query
* @param roots root elements to descend into
* @return matching elements, empty if none
*/
fun select(cssQuery: String, roots: Iterable): Elements {
val cssQuery0 = cssQuery.trim()
if (cssQuery0.isBlank() || !roots.iterator().hasNext()) {
return Elements()
}
val evaluator = parseOrNullCached(cssQuery0, roots.first().baseUri())?: return Elements()
val elements = ArrayList()
val seenElements = IdentityHashMap()
// dedupe elements by identity, not equality
for (root in roots) {
val found = select(evaluator, root)
for (el in found) {
if (!seenElements.containsKey(el)) {
elements.add(el)
seenElements[el] = java.lang.Boolean.TRUE
}
}
}
return Elements(elements)
}
/**
* Find the first element that matches the query.
* @param cssQuery CSS selector
* @param root root element to descend into
* @return the matching element, or **null** if none.
*/
fun selectFirst(cssQuery: String, root: Element): Element? {
val cssQuery0 = cssQuery.trim()
if (cssQuery0.isBlank()) {
return null
}
val evaluator = parseOrNullCached(cssQuery, root.baseUri()) ?: return null
return PowerCollector.findFirst(evaluator, root)
}
/**
* Find elements matching selector.
*
* @param evaluator CSS selector
* @param root root element to descend into
* @return matching elements, empty if none
*/
private fun select(evaluator: Evaluator, root: Element): Elements {
return PowerCollector.collect(evaluator, root)
}
private fun parseOrNullCached(cssQuery: String, baseUri: String): Evaluator? {
val query = normalizeQueryOrNull(cssQuery) ?: return null
val key = "$baseUri $query"
return cache.computeIfAbsent(key) { parseOrNull(query, baseUri) }
}
private fun parseOrNull(cssQuery: String, baseUri: String): Evaluator? {
try {
return PowerQueryParser.parse(cssQuery)
} catch (e: PowerSelectorParseException) {
var message = e.brief()
if (!message.isNullOrBlank()) {
val host = UrlUtils.getURLOrNull(baseUri)?.host
val key = "$host $cssQuery"
message = "$key\n>>>$message<<<"
val count1 = totalParseExceptions.computeIfAbsent(cssQuery) { AtomicInteger() }.incrementAndGet()
val count2 = parseExceptions.computeIfAbsent(message) { AtomicInteger() }.incrementAndGet()
if (count1 > 5000) {
if (count1 % 5000 == 0) {
logger.warn("Caught $count1 parse exceptions | $cssQuery")
}
} else if (count1 > 3000 && count1 % 1000 == 0) {
logger.warn("Caught $count1 parse exceptions | $cssQuery")
} else if (count1 > 1000 && count1 % 200 == 0) {
logger.warn("Caught $count1 parse exceptions | $cssQuery")
} else {
if (count2 == 1) {
logger.warn("Failed to parse css query | $cssQuery | $baseUri | ${e.brief()}")
} else if (count2 < 50 && count2 % 10 == 0) {
logger.warn("Caught $count2 parse exceptions | $cssQuery")
} else if (count2 < 1000 && count2 % 100 == 0) {
logger.warn("Caught $count2 parse exceptions | $cssQuery")
} else if (count2 < 3000 && count2 % 1000 == 0) {
logger.warn("Caught $count2 parse exceptions | $cssQuery")
} else if (count2 % 5000 == 0) {
logger.warn("Caught $count2 parse exceptions | $cssQuery")
}
}
} else {
logger.warn("Unexpected exception", e)
}
}
return null
}
/**
* Normalize the CSS query
* */
fun normalizeQueryOrNull(query: String): String? {
// JCommand do not remove surrounding quotes, like jcommander.parse("-outlink \"ul li a[href~=item]\"")
val query0 = query.removeSurrounding("\"").takeIf { it.isNotBlank() } ?: return null
return PowerEvaluator.encodeQuery(query0)
}
private fun checkArguments(cssQuery: String, offset: Int = 1, limit: Int) {
if (cssQuery.isBlank()) {
throw IllegalArgumentException("cssQuery should not be empty")
}
if (offset < 1) {
throw IllegalArgumentException("Offset should be > 1")
}
if (limit < 0) {
throw IllegalArgumentException("Limit should be >= 0")
}
}
}
© 2015 - 2025 Weber Informatics LLC | Privacy Policy