All Downloads are FREE. Search and download functionalities are using the official Maven repository.

ai.platon.pulsar.skeleton.crawl.fetch.driver.WebDriver.kt Maven / Gradle / Ivy

There is a newer version: 2.1.0
Show newest version
package ai.platon.pulsar.skeleton.crawl.fetch.driver

import ai.platon.pulsar.browser.common.BrowserSettings
import ai.platon.pulsar.browser.driver.chrome.NetworkResourceResponse
import ai.platon.pulsar.common.browser.BrowserType
import ai.platon.pulsar.common.math.geometric.PointD
import ai.platon.pulsar.common.math.geometric.RectD
import ai.platon.pulsar.common.urls.Hyperlink
import ai.platon.pulsar.dom.nodes.GeoAnchor
import com.google.common.annotations.Beta
import org.jsoup.Connection
import java.io.Closeable
import java.time.Duration

/**
 * [WebDriver] defines a concise interface to visit and manipulate webpages.
 *
 * The webpage is rendered to a Document Object Model (DOM) in a real browser, and the interface provides methods to
 * control the browser, select textContent and attributes of Elements, and interact with the webpage.
 *
 * All actions and behaviors are optimized to mimic real people as closely as possible, such as scrolling, clicking,
 * typing text, dragging and dropping, etc.
 *
 * The term `document` here refers to a Document Object Model (DOM) within a browser.
 *
 * The methods in this interface fall into three categories:
 *
 * * Control of the browser itself
 * * Selection of textContent and attributes of Elements
 * * Interact with the webpage
 *
 * Key methods:
 * * [navigateTo]: navigate to a URL.
 * * [currentUrl]: get the current URL displayed in the address bar.
 * * [scrollDown]: scroll down on a webpage to fully load the page. Most modern webpages support lazy loading
 * using ajax tech, where the page content only starts to load when it is scrolled into view.
 * * [pageSource]: retrieve the source code of the webpage.
 *
 * For each document, there are several properties that represent the URL of the document:
 * * `driver.currentUrl()`: Returns the URL displayed in the address bar, it can be either navigated or not.
 * * `driver.url()`, `document.URL`: Returns the URL of the document.
 * * `driver.documentURI()`, `document.documentURI`: Returns the URI of the document.
 * * `driver.baseURI()`, `document.baseURI`: Returns the base URI of the document.
 * * `document.location`: Represents the location (URL) of the current page and allows you to manipulate the URL.
 *
 * In the Document Object Model (DOM), the relationship between `document.URL`, `document.documentURI`,
 * `document.location`, and the URL displayed in the browser's address bar is as follows:
 * * `driver.currentUrl()`:
 *    - This ready-only property displayed in the browser's address bar is what users see and can edit directly.
 *    - This ready-only property can be either navigated or not.
 *    - When the page is loaded or when `document.location` is modified, the address bar is updated to reflect the new URL.
 *    - It is typically synchronized with `document.URL` and `document.location.href` (a property of `document.location`).
 * * `driver.url()`, `document.URL`:
 *    - This property returns the URL of the document as a string.
 *    - It is a read-only property and reflects the current URL of the document.
 *    - Changes to `document.location` will also update `document.URL`.
 * * `driver.documentURI()`, `document.documentURI`:
 *    - This property returns the URI of the document.
 *    - It is also a read-only property and typically contains the same value as `document.URL`.
 *    - However, `document.documentURI` is defined to be the URI that was provided to the parser, which could
 *      potentially differ from `document.URL` in certain cases, although in practice, this is rare.
 * * `driver.baseURI()`, `document.baseURI`:
 *    - This property returns the base URI of the document.
 *    - The base URI is used to resolve relative URLs within the document.
 *    - It is a read-only property and is typically the URL of the document, unless a `` element is present
 *    in the document, in which case the value of the `href` attribute of the `` element is used.
 *    - If no `` element is present, the base URI is the same as `document.URL`.
 * * `document.location`:
 *    - This property represents the location (URL) of the current page and allows you to manipulate the URL.
 *    - It is a read-write property, which means you can change it to navigate to a different page or to manipulate
 *      query strings, fragments, etc.
 *    - Changes to `document.location` will cause the browser to navigate to the new URL, updating both `document.URL`
 *      and the URL displayed in the address bar.
 *
 * In summary, `document.URL` and `document.documentURI` are read-only properties that reflect the current URL of the
 * document, while `document.location` is a read-write property that not only reflects the current URL but also allows
 * you to navigate to a new one. The URL displayed in the address bar is a user-facing representation of the current
 * document's URL, which is usually in sync with `document.location`.
 *
 * In addition to the above properties, The method `driver.referrer()` returns the document's referrer.
 * The `document.referrer` property returns the URI of the page that linked to the current page. If the user navigated
 * directly to the page (e.g., via a bookmark), the value is an empty string. Inside an `