All Downloads are FREE. Search and download functionalities are using the official Maven repository.

ai.platon.pulsar.dom.select.PowerSelector.kt Maven / Gradle / Ivy

package ai.platon.pulsar.dom.select

import ai.platon.pulsar.common.brief
import ai.platon.pulsar.common.concurrent.ConcurrentExpiringLRUCache
import ai.platon.pulsar.common.urls.UrlUtils
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
import org.jsoup.select.Evaluator
import org.slf4j.LoggerFactory
import java.time.Duration
import java.util.*
import java.util.concurrent.atomic.AtomicInteger

class PowerSelectorParseException(msg: String, vararg params: Any) : IllegalArgumentException(String.format(msg, *params))

/**
 * CSS element selector, that finds elements matching a query.
 *
 * 

Selector syntax

* * A selector is a chain of simple selectors, separated by combinators. Selectors are **case insensitive** (including against * elements, attributes, and attribute values). * * The universal selector (*) is implicit when no element selector is supplied (i.e. `*.header` and `.header` * is equivalent). * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
PatternMatchesExample
`*`any element`*`
`tag`elements with the given tag name`div`
`*|E`elements of type E in any namespace *ns*`*|name` finds `` elements
`ns|E`elements of type E in the namespace *ns*`fb|name` finds `` elements
`#id`elements with attribute ID of "id"`div#wrap`, `#logo`
`.class`elements with a class name of "class"`div.left`, `.result`
`[attr]`elements with an attribute named "attr" (with any value)`a[href]`, `[title]`
`[^attrPrefix]`elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets`[^data-]`, `div[^data-]`
`[attr=val]`elements with an attribute named "attr", and value equal to "val"`img[width=500]`, `a[rel=nofollow]`
`[attr="val"]`elements with an attribute named "attr", and value equal to "val"`span[hello="Cleveland"][goodbye="Columbus"]`, `a[rel="nofollow"]`
`[attr^=valPrefix]`elements with an attribute named "attr", and value starting with "valPrefix"`a[href^=http:]`
`[attr$=valSuffix]`elements with an attribute named "attr", and value ending with "valSuffix"`img[src$=.png]`
`[attr*=valContaining]`elements with an attribute named "attr", and value containing "valContaining"`a[href*=/search/]`
`[attr~=*regex*]`elements with an attribute named "attr", and value matching the regular expression`img[src~=(?i)\\.(png|jpe?g)]`
The above may be combined in any order`div.header[title]`

Combinators

`E F`an F element descended from an E element`div a`, `.logo h1`
`E > F`an F direct child of E`ol > li`
`E + F`an F element immediately preceded by sibling E`li + li`, `div.head + div`
`E ~ F`an F element preceded by sibling E`h1 ~ p`
`E, F, G`all matching elements E, F, or G`a[href], div, h3`

Pseudo selectors

`:lt(*n*)`elements whose sibling index is less than *n*`td:lt(3)` finds the first 3 cells of each row
`:gt(*n*)`elements whose sibling index is greater than *n*`td:gt(1)` finds cells after skipping the first two
`:eq(*n*)`elements whose sibling index is equal to *n*`td:eq(0)` finds the first cell of each row
`:has(*selector*)`elements that contains at least one element matching the *selector*`div:has(p)` finds divs that contain p elements
`:not(*selector*)`elements that do not match the *selector*. See also [Elements.not]`div:not(.logo)` finds all divs that do not have the "logo" class. * *`div:not(:has(div))` finds divs that do not contain divs.
`:contains(*text*)`elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.`p:contains(dom)` finds p elements containing the text "dom".
`:matches(*regex*)`elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.`td:matches(\\d+)` finds table cells containing digits. `div:matches((?i)login)` finds divs containing the text, case insensitively.
`:containsOwn(*text*)`elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.`p:containsOwn(dom)` finds p elements with own text "dom".
`:matchesOwn(*regex*)`elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.`td:matchesOwn(\\d+)` finds table cells directly containing digits. `div:matchesOwn((?i)login)` finds divs containing the text, case insensitively.
`:containsData(*data*)`elements that contains the specified *data*. The contents of `script` and `style` elements, and `comment` nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants.`script:contains(dom)` finds script elements containing the data "dom".
The above may be combined in any order and with other selectors`.light:contains(name):eq(0)`

Structural pseudo selectors

`:root`The element that is the root of the document. In HTML, this is the `html` element`:root`
`:nth-child(*a*n+*b*)` * *elements that have `*a*n+*b*-1` siblings **before** it in the document tree, for any positive integer or zero value of `n`, and has a parent element. For values of `a` and `b` greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the *b*th element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The `a` and `b` values must be integers (positive, negative, or zero). The index of the first child of an element is 1. * In addition to this, `:nth-child()` can take `odd` and `even` as arguments instead. `odd` has the same signification as `2n+1`, and `even` has the same signification as `2n`.`tr:nth-child(2n+1)` finds every odd row of a table. `:nth-child(10n-1)` the 9th, 19th, 29th, etc, element. `li:nth-child(5)` the 5h li
`:nth-last-child(*a*n+*b*)`elements that have `*a*n+*b*-1` siblings **after** it in the document tree. Otherwise like `:nth-child()``tr:nth-last-child(-n+2)` the last two rows of a table
`:nth-of-type(*a*n+*b*)`pseudo-class notation represents an element that has `*a*n+*b*-1` siblings with the same expanded element name *before* it in the document tree, for any zero or positive integer value of n, and has a parent element`img:nth-of-type(2n+1)`
`:nth-last-of-type(*a*n+*b*)`pseudo-class notation represents an element that has `*a*n+*b*-1` siblings with the same expanded element name *after* it in the document tree, for any zero or positive integer value of n, and has a parent element`img:nth-last-of-type(2n+1)`
`:first-child`elements that are the first child of some other element.`div > p:first-child`
`:last-child`elements that are the last child of some other element.`ol > li:last-child`
`:first-of-type`elements that are the first sibling of its type in the list of children of its parent element`dl dt:first-of-type`
`:last-of-type`elements that are the last sibling of its type in the list of children of its parent element`tr > td:last-of-type`
`:only-child`elements that have a parent element and whose parent element hasve no other element children
`:only-of-type` an element that has a parent element and whose parent element has no other element children with the same expanded element name
`:empty`elements that have no children at all
* * @author Jonathan Hedley, [email protected] * @see Element.select */ object PowerSelector { private val logger = LoggerFactory.getLogger(PowerSelector::class.java) private val cache = ConcurrentExpiringLRUCache(Duration.ofMinutes(10)) private val parseExceptions = ConcurrentExpiringLRUCache(Duration.ofMinutes(10)) private val totalParseExceptions = ConcurrentExpiringLRUCache(Duration.ofMinutes(10)) /** * Find elements matching selector. * * @param cssQuery A CSS query * @param root root element to descend into * @return matching elements, empty if none */ fun select(cssQuery: String, root: Element): Elements { val cssQuery0 = cssQuery.trim() if (cssQuery0.isBlank()) { return Elements() } val evaluator = parseOrNullCached(cssQuery0, root.baseUri()) ?: return Elements() return select(evaluator, root) } fun select(cssQuery: String, root: Element, offset: Int = 1, limit: Int = Int.MAX_VALUE): Elements { checkArguments(cssQuery, offset, limit) return select(cssQuery, root).asSequence().drop(offset - 1).take(limit).toCollection(Elements()) } fun select(cssQuery: String, root: Element, offset: Int = 1, limit: Int = Int.MAX_VALUE, transformer: (Element) -> O): List { checkArguments(cssQuery, offset, limit) // TODO: do the filter inside Collector.collect return select(cssQuery, root).asSequence().drop(offset - 1).take(limit).map { transformer(it) }.toList() } /** * Find elements matching selector. * * @param cssQuery CSS query * @param roots root elements to descend into * @return matching elements, empty if none */ fun select(cssQuery: String, roots: Iterable): Elements { val cssQuery0 = cssQuery.trim() if (cssQuery0.isBlank() || !roots.iterator().hasNext()) { return Elements() } val evaluator = parseOrNullCached(cssQuery0, roots.first().baseUri())?: return Elements() val elements = ArrayList() val seenElements = IdentityHashMap() // dedupe elements by identity, not equality for (root in roots) { val found = select(evaluator, root) for (el in found) { if (!seenElements.containsKey(el)) { elements.add(el) seenElements[el] = java.lang.Boolean.TRUE } } } return Elements(elements) } /** * Find the first element that matches the query. * @param cssQuery CSS selector * @param root root element to descend into * @return the matching element, or **null** if none. */ fun selectFirst(cssQuery: String, root: Element): Element? { val cssQuery0 = cssQuery.trim() if (cssQuery0.isBlank()) { return null } val evaluator = parseOrNullCached(cssQuery, root.baseUri()) ?: return null return PowerCollector.findFirst(evaluator, root) } /** * Find elements matching selector. * * @param evaluator CSS selector * @param root root element to descend into * @return matching elements, empty if none */ private fun select(evaluator: Evaluator, root: Element): Elements { return PowerCollector.collect(evaluator, root) } private fun parseOrNullCached(cssQuery: String, baseUri: String): Evaluator? { val query = normalizeQueryOrNull(cssQuery) ?: return null val key = "$baseUri $query" return cache.computeIfAbsent(key) { parseOrNull(query, baseUri) } } private fun parseOrNull(cssQuery: String, baseUri: String): Evaluator? { try { return PowerQueryParser.parse(cssQuery) } catch (e: PowerSelectorParseException) { var message = e.brief() if (!message.isNullOrBlank()) { val host = UrlUtils.getURLOrNull(baseUri)?.host val key = "$host $cssQuery" message = "$key\n>>>$message<<<" val count1 = totalParseExceptions.computeIfAbsent(cssQuery) { AtomicInteger() }.incrementAndGet() val count2 = parseExceptions.computeIfAbsent(message) { AtomicInteger() }.incrementAndGet() if (count1 > 5000) { if (count1 % 5000 == 0) { logger.warn("Caught $count1 parse exceptions | $cssQuery") } } else if (count1 > 3000 && count1 % 1000 == 0) { logger.warn("Caught $count1 parse exceptions | $cssQuery") } else if (count1 > 1000 && count1 % 200 == 0) { logger.warn("Caught $count1 parse exceptions | $cssQuery") } else { if (count2 == 1) { logger.warn("Failed to parse css query | $cssQuery | $baseUri | ${e.brief()}") } else if (count2 < 50 && count2 % 10 == 0) { logger.warn("Caught $count2 parse exceptions | $cssQuery") } else if (count2 < 1000 && count2 % 100 == 0) { logger.warn("Caught $count2 parse exceptions | $cssQuery") } else if (count2 < 3000 && count2 % 1000 == 0) { logger.warn("Caught $count2 parse exceptions | $cssQuery") } else if (count2 % 5000 == 0) { logger.warn("Caught $count2 parse exceptions | $cssQuery") } } } else { logger.warn("Unexpected exception", e) } } return null } /** * Normalize the CSS query * */ fun normalizeQueryOrNull(query: String): String? { // JCommand do not remove surrounding quotes, like jcommander.parse("-outlink \"ul li a[href~=item]\"") val query0 = query.removeSurrounding("\"").takeIf { it.isNotBlank() } ?: return null return PowerEvaluator.encodeQuery(query0) } private fun checkArguments(cssQuery: String, offset: Int = 1, limit: Int) { if (cssQuery.isBlank()) { throw IllegalArgumentException("cssQuery should not be empty") } if (offset < 1) { throw IllegalArgumentException("Offset should be > 1") } if (limit < 0) { throw IllegalArgumentException("Limit should be >= 0") } } }




© 2015 - 2025 Weber Informatics LLC | Privacy Policy