com.itextpdf.styledxmlparser.jsoup.select.Selector Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of styled-xml-parser Show documentation

Styled XML parser is used by iText modules to parse HTML and XML

There is a newer version: 9.0.0

/*
    This file is part of the iText (R) project.
    Copyright (c) 1998-2020 iText Group NV
    Authors: iText Software.

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU Affero General Public License version 3
    as published by the Free Software Foundation with the addition of the
    following permission added to Section 15 as permitted in Section 7(a):
    FOR ANY PART OF THE COVERED WORK IN WHICH THE COPYRIGHT IS OWNED BY
    ITEXT GROUP. ITEXT GROUP DISCLAIMS THE WARRANTY OF NON INFRINGEMENT
    OF THIRD PARTY RIGHTS

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    or FITNESS FOR A PARTICULAR PURPOSE.
    See the GNU Affero General Public License for more details.
    You should have received a copy of the GNU Affero General Public License
    along with this program; if not, see http://www.gnu.org/licenses or write to
    the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
    Boston, MA, 02110-1301 USA, or download the license from the following URL:
    http://itextpdf.com/terms-of-use/

    The interactive user interfaces in modified source and object code versions
    of this program must display Appropriate Legal Notices, as required under
    Section 5 of the GNU Affero General Public License.

    In accordance with Section 7(b) of the GNU Affero General Public License,
    a covered work must retain the producer line in every PDF that is created
    or manipulated using iText.

    You can be released from the requirements of the license by purchasing
    a commercial license. Buying such a license is mandatory as soon as you
    develop commercial activities involving the iText software without
    disclosing the source code of your own applications.
    These activities include: offering paid services to customers as an ASP,
    serving PDFs on the fly in a web application, shipping iText with a closed
    source product.

    For more information, please contact iText Software Corp. at this
    address: [email protected]
 */
package com.itextpdf.styledxmlparser.jsoup.select;

import com.itextpdf.styledxmlparser.jsoup.helper.Validate;
import com.itextpdf.styledxmlparser.jsoup.nodes.Element;

import com.itextpdf.io.util.MessageFormatUtil;
import java.util.ArrayList;
import java.util.Collection;
import java.util.IdentityHashMap;

/**
 * CSS-like element selector, that finds elements matching a query.
 *
 * Selector syntax
 * 
 * A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against
 * elements, attributes, and attribute values).
 * 

 * The universal selector (*) is implicit when no element selector is supplied (i.e. {@code *.header} and {@code .header}
 * is equivalent).
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * Pattern Matches Example
* any element *
tag elements with the given tag name div
ns|E elements of type E in the namespace ns fb|name finds <fb:name> elements
#id elements with attribute ID of "id" div#wrap, #logo
.class elements with a class name of "class" div.left, .result
[attr] elements with an attribute named "attr" (with any value) a[href], [title]
[^attrPrefix] elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets [^data-], div[^data-]
[attr=val] elements with an attribute named "attr", and value equal to "val" img[width=500], a[rel=nofollow]
[attr="val"] elements with an attribute named "attr", and value equal to "val" span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"]
[attr^=valPrefix] elements with an attribute named "attr", and value starting with "valPrefix" a[href^=http:]
[attr$=valSuffix] elements with an attribute named "attr", and value ending with "valSuffix" img[src$=.png]
[attr*=valContaining] elements with an attribute named "attr", and value containing "valContaining" a[href*=/search/]
[attr~=regex] elements with an attribute named "attr", and value matching the regular expression img[src~=(?i)\\.(png|jpe?g)]
The above may be combined in any order div.header[title]
Combinators
E F an F element descended from an E element div a, .logo h1
E {@literal >} F an F direct child of E ol {@literal >} li
E + F an F element immediately preceded by sibling E li + li, div.head + div
E ~ F an F element preceded by sibling E h1 ~ p
E, F, G all matching elements E, F, or G a[href], div, h3
Pseudo selectors
:lt(n) elements whose sibling index is less than n td:lt(3) finds the first 3 cells of each row
:gt(n) elements whose sibling index is greater than n td:gt(1) finds cells after skipping the first two
:eq(n) elements whose sibling index is equal to n td:eq(0) finds the first cell of each row
:has(selector) elements that contains at least one element matching the selector div:has(p) finds divs that contain p elements 
:not(selector) elements that do not match the selector. See also {@link Elements#not(String)} div:not(.logo) finds all divs that do not have the "logo" class.div:not(:has(div)) finds divs that do not contain divs.
:contains(text) elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. p:contains(jsoup) finds p elements containing the text "jsoup".
:matches(regex) elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively.
:containsOwn(text) elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. p:containsOwn(jsoup) finds p elements with own text "jsoup".
:matchesOwn(regex) elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. td:matchesOwn(\\d+) finds table cells directly containing digits. div:matchesOwn((?i)login) finds divs containing the text, case insensitively.
The above may be combined in any order and with other selectors .light:contains(name):eq(0)
Structural pseudo selectors
:root The element that is the root of the document. In HTML, this is the html element :root
:nth-child(an+b) elements that have an+b-1 siblings before it in the document tree, for any positive integer or zero value of n, and has a parent element. For values of a and b greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The a and b values must be integers (positive, negative, or zero). The index of the first child of an element is 1.

 * In addition to this, :nth-child() can take odd and even as arguments instead. odd has the same signification as 2n+1, and even has the same signification as 2n. tr:nth-child(2n+1) finds every odd row of a table. :nth-child(10n-1) the 9th, 19th, 29th, etc, element. li:nth-child(5) the 5h li
:nth-last-child(an+b) elements that have an+b-1 siblings after it in the document tree. Otherwise like :nth-child() tr:nth-last-child(-n+2) the last two rows of a table
:nth-of-type(an+b) pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element img:nth-of-type(2n+1)
:nth-last-of-type(an+b) pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element img:nth-last-of-type(2n+1)
:first-child elements that are the first child of some other element. div {@literal >} p:first-child
:last-child elements that are the last child of some other element. ol {@literal >} li:last-child
:first-of-type elements that are the first sibling of its type in the list of children of its parent element dl dt:first-of-type
:last-of-type elements that are the last sibling of its type in the list of children of its parent element tr {@literal >} td:last-of-type
:only-child elements that have a parent element and whose parent element hasve no other element children 
:only-of-type  an element that has a parent element and whose parent element has no other element children with the same expanded element name 
:empty elements that have no children at all 
 *
 * @author Jonathan Hedley, [email protected]
 * @see Element#select(String)
 */
public class Selector {
    private final Evaluator evaluator;
    private final Element root;

    private Selector(String query, Element root) {
        Validate.notNull(query);
        query = query.trim();
        Validate.notEmpty(query);
        Validate.notNull(root);

        this.evaluator = QueryParser.parse(query);

        this.root = root;
    }

    private Selector(Evaluator evaluator, Element root) {
        Validate.notNull(evaluator);
        Validate.notNull(root);

        this.evaluator = evaluator;
        this.root = root;
    }

    /**
     * Find elements matching selector.
     *
     * @param query CSS selector
     * @param root  root element to descend into
     * @return matching elements, empty if none
     * @throws Selector.SelectorParseException (unchecked) on an invalid CSS query.
     */
    public static Elements select(String query, Element root) {
        return new Selector(query, root).select();
    }

    /**
     * Find elements matching selector.
     *
     * @param evaluator CSS selector
     * @param root root element to descend into
     * @return matching elements, empty if none
     */
    public static Elements select(Evaluator evaluator, Element root) {
        return new Selector(evaluator, root).select();
    }

    /**
     * Find elements matching selector.
     *
     * @param query CSS selector
     * @param roots root elements to descend into
     * @return matching elements, empty if none
     */
    public static Elements select(String query, Iterable roots) {
        Validate.notEmpty(query);
        Validate.notNull(roots);
        Evaluator evaluator = QueryParser.parse(query);
        ArrayList elements = new ArrayList();
        IdentityHashMap seenElements = new IdentityHashMap();
        // dedupe elements by identity, not equality

        for (Element root : roots) {
            final Elements found = select(evaluator, root);
            for (Element el : found) {
                if (!seenElements.containsKey(el)) {
                    elements.add(el);
                    seenElements.put(el, Boolean.TRUE);
                }
            }
        }
        return new Elements(elements);
    }

    private Elements select() {
        return Collector.collect(evaluator, root);
    }

    // exclude set. package open so that Elements can implement .not() selector.
    static Elements filterOut(Collection elements, Collection outs) {
        Elements output = new Elements();
        for (Element el : elements) {
            boolean found = false;
            for (Element out : outs) {
                if (el.equals(out)) {
                    found = true;
                    break;
                }
            }
            if (!found)
                output.add(el);
        }
        return output;
    }

    public static class SelectorParseException extends IllegalStateException {
        public SelectorParseException(String msg, Object... params) {
            super(MessageFormatUtil.format(msg, params));
        }
    }
}

Pattern	Matches	Example
`*`	any element	`*`
`tag`	elements with the given tag name	`div`
`ns\|E`	elements of type E in the namespace ns	`fb\|name` finds `<fb:name>` elements
`#id`	elements with attribute ID of "id"	`div#wrap`, `#logo`
`.class`	elements with a class name of "class"	`div.left`, `.result`
`[attr]`	elements with an attribute named "attr" (with any value)	`a[href]`, `[title]`
`[^attrPrefix]`	elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets	`[^data-]`, `div[^data-]`
`[attr=val]`	elements with an attribute named "attr", and value equal to "val"	`img[width=500]`, `a[rel=nofollow]`
`[attr="val"]`	elements with an attribute named "attr", and value equal to "val"	`span[hello="Cleveland"][goodbye="Columbus"]`, `a[rel="nofollow"]`
`[attr^=valPrefix]`	elements with an attribute named "attr", and value starting with "valPrefix"	`a[href^=http:]`
`[attr$=valSuffix]`	elements with an attribute named "attr", and value ending with "valSuffix"	`img[src$=.png]`
`[attr*=valContaining]`	elements with an attribute named "attr", and value containing "valContaining"	`a[href*=/search/]`
`[attr~=regex]`	elements with an attribute named "attr", and value matching the regular expression	`img[src~=(?i)\\.(png\|jpe?g)]`
	The above may be combined in any order	`div.header[title]`
Combinators
`E F`	an F element descended from an E element	`div a`, `.logo h1`
`E {@literal >} F`	an F direct child of E	`ol {@literal >} li`
`E + F`	an F element immediately preceded by sibling E	`li + li`, `div.head + div`
`E ~ F`	an F element preceded by sibling E	`h1 ~ p`
`E, F, G`	all matching elements E, F, or G	`a[href], div, h3`
Pseudo selectors
`:lt(n)`	elements whose sibling index is less than n	`td:lt(3)` finds the first 3 cells of each row
`:gt(n)`	elements whose sibling index is greater than n	`td:gt(1)` finds cells after skipping the first two
`:eq(n)`	elements whose sibling index is equal to n	`td:eq(0)` finds the first cell of each row
`:has(selector)`	elements that contains at least one element matching the selector	`div:has(p)` finds divs that contain p elements
`:not(selector)`	elements that do not match the selector. See also {@link Elements#not(String)}	`div:not(.logo)` finds all divs that do not have the "logo" class. `div:not(:has(div))` finds divs that do not contain divs.
`:contains(text)`	elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.	`p:contains(jsoup)` finds p elements containing the text "jsoup".
`:matches(regex)`	elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.	`td:matches(\\d+)` finds table cells containing digits. `div:matches((?i)login)` finds divs containing the text, case insensitively.
`:containsOwn(text)`	elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.	`p:containsOwn(jsoup)` finds p elements with own text "jsoup".
`:matchesOwn(regex)`	elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.	`td:matchesOwn(\\d+)` finds table cells directly containing digits. `div:matchesOwn((?i)login)` finds divs containing the text, case insensitively.
	The above may be combined in any order and with other selectors	`.light:contains(name):eq(0)`
Structural pseudo selectors
`:root`	The element that is the root of the document. In HTML, this is the `html` element	`:root`
`:nth-child(an+b)`	elements that have `an+b-1` siblings before it in the document tree, for any positive integer or zero value of `n`, and has a parent element. For values of `a` and `b` greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The `a` and `b` values must be integers (positive, negative, or zero). The index of the first child of an element is 1. * In addition to this, `:nth-child()` can take `odd` and `even` as arguments instead. `odd` has the same signification as `2n+1`, and `even` has the same signification as `2n`.	`tr:nth-child(2n+1)` finds every odd row of a table. `:nth-child(10n-1)` the 9th, 19th, 29th, etc, element. `li:nth-child(5)` the 5h li
`:nth-last-child(an+b)`	elements that have `an+b-1` siblings after it in the document tree. Otherwise like `:nth-child()`	`tr:nth-last-child(-n+2)` the last two rows of a table
`:nth-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-of-type(2n+1)`
`:nth-last-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-last-of-type(2n+1)`
`:first-child`	elements that are the first child of some other element.	`div {@literal >} p:first-child`
`:last-child`	elements that are the last child of some other element.	`ol {@literal >} li:last-child`
`:first-of-type`	elements that are the first sibling of its type in the list of children of its parent element	`dl dt:first-of-type`
`:last-of-type`	elements that are the last sibling of its type in the list of children of its parent element	`tr {@literal >} td:last-of-type`
`:only-child`	elements that have a parent element and whose parent element hasve no other element children
`:only-of-type`	an element that has a parent element and whose parent element has no other element children with the same expanded element name
`:empty`	elements that have no children at all