ai.platon.pulsar.skeleton.crawl.fetch.driver.WebDriver.kt Maven / Gradle / Ivy
The newest version!
package ai.platon.pulsar.skeleton.crawl.fetch.driver
import ai.platon.pulsar.browser.common.BrowserSettings
import ai.platon.pulsar.browser.driver.chrome.NetworkResourceResponse
import ai.platon.pulsar.common.browser.BrowserType
import ai.platon.pulsar.common.math.geometric.PointD
import ai.platon.pulsar.common.math.geometric.RectD
import ai.platon.pulsar.common.urls.Hyperlink
import ai.platon.pulsar.dom.nodes.GeoAnchor
import com.google.common.annotations.Beta
import org.jsoup.Connection
import java.io.Closeable
import java.time.Duration
/**
* [WebDriver] defines a concise interface to visit and manipulate webpages.
*
* The webpage is rendered to a Document Object Model (DOM) in a real browser, and the interface provides methods to
* control the browser, select textContent and attributes of Elements, and interact with the webpage.
*
* All actions and behaviors are optimized to mimic real people as closely as possible, such as scrolling, clicking,
* typing text, dragging and dropping, etc.
*
* The term `document` here refers to a Document Object Model (DOM) within a browser.
*
* The methods in this interface fall into three categories:
*
* * Control of the browser itself
* * Selection of textContent and attributes of Elements
* * Interact with the webpage
*
* Key methods:
* * [navigateTo]: navigate to a URL.
* * [currentUrl]: get the current URL displayed in the address bar.
* * [scrollDown]: scroll down on a webpage to fully load the page. Most modern webpages support lazy loading
* using ajax tech, where the page content only starts to load when it is scrolled into view.
* * [pageSource]: retrieve the source code of the webpage.
*
* For each document, there are several properties that represent the URL of the document:
* * `driver.currentUrl()`: Returns the URL displayed in the address bar, it can be either navigated or not.
* * `driver.url()`, `document.URL`: Returns the URL of the document.
* * `driver.documentURI()`, `document.documentURI`: Returns the URI of the document.
* * `driver.baseURI()`, `document.baseURI`: Returns the base URI of the document.
* * `document.location`: Represents the location (URL) of the current page and allows you to manipulate the URL.
*
* In the Document Object Model (DOM), the relationship between `document.URL`, `document.documentURI`,
* `document.location`, and the URL displayed in the browser's address bar is as follows:
* * `driver.currentUrl()`:
* - This ready-only property displayed in the browser's address bar is what users see and can edit directly.
* - This ready-only property can be either navigated or not.
* - When the page is loaded or when `document.location` is modified, the address bar is updated to reflect the new URL.
* - It is typically synchronized with `document.URL` and `document.location.href` (a property of `document.location`).
* * `driver.url()`, `document.URL`:
* - This property returns the URL of the document as a string.
* - It is a read-only property and reflects the current URL of the document.
* - Changes to `document.location` will also update `document.URL`.
* * `driver.documentURI()`, `document.documentURI`:
* - This property returns the URI of the document.
* - It is also a read-only property and typically contains the same value as `document.URL`.
* - However, `document.documentURI` is defined to be the URI that was provided to the parser, which could
* potentially differ from `document.URL` in certain cases, although in practice, this is rare.
* * `driver.baseURI()`, `document.baseURI`:
* - This property returns the base URI of the document.
* - The base URI is used to resolve relative URLs within the document.
* - It is a read-only property and is typically the URL of the document, unless a ` ` element is present
* in the document, in which case the value of the `href` attribute of the ` ` element is used.
* - If no ` ` element is present, the base URI is the same as `document.URL`.
* * `document.location`:
* - This property represents the location (URL) of the current page and allows you to manipulate the URL.
* - It is a read-write property, which means you can change it to navigate to a different page or to manipulate
* query strings, fragments, etc.
* - Changes to `document.location` will cause the browser to navigate to the new URL, updating both `document.URL`
* and the URL displayed in the address bar.
*
* In summary, `document.URL` and `document.documentURI` are read-only properties that reflect the current URL of the
* document, while `document.location` is a read-write property that not only reflects the current URL but also allows
* you to navigate to a new one. The URL displayed in the address bar is a user-facing representation of the current
* document's URL, which is usually in sync with `document.location`.
*
* In addition to the above properties, The method `driver.referrer()` returns the document's referrer.
* The `document.referrer` property returns the URI of the page that linked to the current page. If the user navigated
* directly to the page (e.g., via a bookmark), the value is an empty string. Inside an `