From b6130745f44545c944ae8b682f8defdf176d7ee9 Mon Sep 17 00:00:00 2001 From: Isaac Muse Date: Sat, 17 Aug 2019 18:58:02 -0600 Subject: [PATCH] Selector doc improvements (#153) * Improvements to selector documentation - Move non-applicable pseudo classes to "Non-Applicable Psuedo Classes" - Rework some paragraphs - Add links to developer.mozilla.org selector references - Start adding usage examples - Removing spacing after badge icons * Add more selector examples * Update escape related info * Add more examples * More examples * More selector docs improvements * More examples and document enhancements * Finish pseudo class examples * Finish up adding examples for namespaces * Fix some typos --- docs/src/dictionary/en-custom.txt | 1 + .../markdown/_snippets/selector_styles.txt | 1 - docs/src/markdown/differences.md | 11 +- docs/src/markdown/selectors.md | 2541 ++++++++++++----- soupsieve/__meta__.py | 2 +- 5 files changed, 1886 insertions(+), 670 deletions(-) diff --git a/docs/src/dictionary/en-custom.txt b/docs/src/dictionary/en-custom.txt index cc9c6111..3b7ba6a6 100644 --- a/docs/src/dictionary/en-custom.txt +++ b/docs/src/dictionary/en-custom.txt @@ -30,6 +30,7 @@ accessor amongst boolean builtin +centric combinator combinators deprecations diff --git a/docs/src/markdown/_snippets/selector_styles.txt b/docs/src/markdown/_snippets/selector_styles.txt index 039bcb1b..02c88808 100644 --- a/docs/src/markdown/_snippets/selector_styles.txt +++ b/docs/src/markdown/_snippets/selector_styles.txt @@ -6,7 +6,6 @@ span.star::after { position: relative; display: inline-block; font: normal normal normal 16px/1 FontAwesome; - padding-right: .50rem; -moz-osx-font-smoothing: initial; -webkit-font-smoothing: initial; font-weight: 400; diff --git a/docs/src/markdown/differences.md b/docs/src/markdown/differences.md index c6e37935..28d519f2 100644 --- a/docs/src/markdown/differences.md +++ b/docs/src/markdown/differences.md @@ -18,13 +18,20 @@ about a malformed attribute, you may need to quote the value. For instance, if you previously used a selector like this: ```py3 -soup.select('[div={}]') +soup.select('[attr={}]') ``` You would need to quote the value as `{}` is not a valid CSS identifier, so it must be quoted: ```py3 -soup.select('[div="{}"]') +soup.select('[attr="{}"]') +``` + +You can also use the [escape](./api.md#soupsieveescape) function to escape dynamic content: + +```py3 +import soupsieve +soup.select('[attr=%s]' % soupsieve.escape('{}')) ``` ## CSS Identifiers diff --git a/docs/src/markdown/selectors.md b/docs/src/markdown/selectors.md index 94e05af1..e8ac3401 100644 --- a/docs/src/markdown/selectors.md +++ b/docs/src/markdown/selectors.md @@ -1,8 +1,17 @@ # CSS Selectors -The CSS selectors are based off of the CSS specification and includes not only stable selectors, but also selectors -currently under development from the draft specifications. Primarily support has been added for selectors that were -feasible to implement and most likely to get practical use. +## Overview + +The CSS selectors are based off of the CSS specification and includes not only stable selectors, but may also include +selectors currently under development from the draft specifications. Primarily support has been added for selectors that +were feasible to implement and most likely to get practical use. In addition to the selectors in the specification, +Soup Sieve also supports a couple non-standard selectors. + +Soup Sieve aims to allow users to target XML/HTML elements with CSS selectors. It implements many pseudo classes, but it +does not currently implement any pseudo elements and has no plans to do so. Soup Sieve also will not match anything for +pseudo classes that are only relevant in a live, browser environment, but it will gracefully handle them if they've been +implemented; such pseudo classes are non-applicable in the Beautiful Soup environment and are noted in [Non-Applicable +Pseudo Classes](#non-applicable-pseudo-classes). When speaking about namespaces, they only apply to XML, XHTML, or when dealing with recognized foreign tags in HTML5. Currently, Beautiful Soup's `html5lib` parser is the only parser that will return the appropriate namespaces for a HTML5 @@ -20,15 +29,6 @@ any are found. Description - - -Some selectors are dependent upon certain states in a web browser or other context which is simply not present outside a -web browser. An example would be the `:focus` selector. In Soup Sieve, `:focus` will match nothing because elements -cannot be focused outside of a browser without simulation, or somehow connecting to a browser. These types of selectors, -that provide no meaningful information in Soup Sieve, will be marked with . - - - Some selectors are very specific to HTML and either have no meaningful representation in XML, or such functionality has @@ -52,6 +52,11 @@ All selectors that are from the current working draft of CSS4 are considered exp . Additionally, if there are other immature selectors, they may be marked as experimental as well. Experimental may mean we are not entirely sure if our implementation is correct, that things may still be in flux as they are part of a working draft, or even both. + +If at anytime a working draft drops a selector from the current draft, it will most likely also be removed here, +most likely with a deprecation path, except where there may be a conflict that requires a less graceful transition. +One exception is in the rare case that the selector is found to be far too useful despite being rejected. In these +cases, we may adopt them as "custom" selectors. @@ -73,17 +78,37 @@ as they are part of a working draft, or even both. [HTML Living Standard](https://html.spec.whatwg.org/) : The HTML Living Standard document. Defines semantics regarding HTML. -!!! warning "Working Draft Selectors" +## Selector Terminology + +Certain terminology is used throughout this document when describing selectors. In order to fully understand the syntax +a selector may implement, it is important to understand a couple of key terms. + +### Selector + +Selector is used to describe any selector whether it is a [simple](#simple-selector), [compound](#compound-selector), or +[complex](#complex-selector) selector. + +### Simple Selector - If at anytime a working draft drops a selector from the current draft, it will most likely also be removed here, - most likely with a deprecation path, except where there may be a conflict that requires a less graceful transition. - One exception is in the rare case that the selector is found to be far too useful despite being rejected. In these - cases, we may adopt them as "custom" selectors. +A simple selector represents a single condition on an element. It can be a [type selector](#type-selectors), +[universal selector](#universal-selectors), [ID selector](#id-selectors), [class selector](#class-selectors), +[attribute selector](#attribute-selectors), or [pseudo class selector](#pseudo-classes). -!!! danger "Not Implemented" - Pseudo elements are not supported as they do not represent real elements. +### Compound Selector - At-rules (`@page`, etc.) are also not supported. +A [compound](#compound-selector) selector is a sequence of [simple](#simple-selector) selectors. They do not contain any +[combinators](#combinators-and-selector-lists). If a universal or type selector is used, they must come first, and only +one instance of either a universal or type selector can be used, both cannot be used at the same time. + +### Complex Selector + +A complex selector consists of multiple [simple](#simple-selector) or [compound](#compound-selector) selectors joined +with [combinators](#combinators-and-selector-lists). + +### Selector List + +A selector list is a list of selectors joined with a comma (`,`). A selector list is used to specify that a match is +valid if any of the selectors in a list matches. ## Escapes @@ -96,6 +121,8 @@ you need to terminate an escape to avoid it accumulating unintended hexadecimal character: `#!css \+` --> `+`. The one exception is that you cannot escape the form feed, newline, or carriage return. +You can always use Soup Sieve's [escape command](./api.md#soupsieveescape) to escape identifiers as well. + ## Basic Selectors ### Type Selectors @@ -106,212 +133,492 @@ If a default namespace is defined in the [namespace dictionary](./api.md#namespa [namespace](#namespace-selectors) is explicitly defined, it will be assumed that the element must be in the default namespace. -!!! example "Type Example" - The following would select all `#!html
` elements. +```css tab="Syntax" +element +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...
Here is some text.
+...
Here is some more text.
+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('div')) +[
Here is some text.
,
Here is some more text.
] +``` - ```css - div - ``` +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors ### Universal Selectors The Universal selector (`*`) matches elements of any type. -!!! example - The following would match any element: `div`, `a`, `p`, etc. +```css tab="Syntax" +* +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...

Here is some text.

+...
Here is some more text.
+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('*')) +[ + +
Here is some text.
+
Here is some more text.
+ + +, , +
Here is some text.
+
Here is some more text.
+ + +,
Here is some text.
,
Here is some more text.
] +``` - ```css - * - ``` +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/Universal_selectors ### ID Selectors The ID selector matches an element based on its `id` attribute. The ID must match exactly. -!!! example - The following would select the element with the id `some-id`. +```css tab="Syntax" +#id +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...
Here is some text.
+...
Here is some more text.
+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('#some-id')) +[
Here is some text.
] +``` - ```css - #some-id - ``` +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors + +!!! note "XML Support" + While the use of the `id` attribute (in the context of CSS) is a very HTML centric idea, it is supported for XML as + well because Beautiful Soup supported it before Soup Sieve's existence. ### Class Selectors The class selector matches an element based on the values contained in the `class` attribute. The `class` attribute is treated as a whitespace separated list, where each item is a **class**. -!!! example - The following would select the elements with the class `some-class`. +```css tab="Syntax" +.class +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...
Here is some text.
+...
Here is some more text.
+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('.some-class')) +[
Here is some text.
] +``` - ```css - .some-class - ``` +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors + +!!! note "XML Support" + While the use of the `class` attribute (in the context of CSS) is a very HTML centric idea, it is supported for XML + as well because Beautiful Soup supported it before Soup Sieve's existence. ### Attribute Selectors The attribute selector matches an element based on its attributes. When specifying a value of an attribute, if it contains whitespace or special characters, you should quote them with either single or double quotes. +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors + `[attribute]` : Represents elements with an attribute named **attribute**. - !!! example - The following would select all elements with a `target` attribute. + ```css tab="Syntax" + [attr] + ``` - ```css - [target] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href]')) + [Internal link, Example link, Insensitive internal link, Example org link] + ``` `[attribute=value]` : Represents elements with an attribute named **attribute** that also has a value of **value**. - !!! example - The following would select all elements with the `target` attribute whose value was `_blank`. + ```css tab="Syntax" + [attr=value] + [attr="value"] + ``` - ```css - [target=_blank] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href="#internal"]')) + [Internal link] + ``` `[attribute~=value]` : Represents elements with an attribute named **attribute** whose value is a space separated list which contains **value**. - !!! example - The following would select all elements with a `title` attribute containing the word `flower`. + ```css tab="Syntax" + [attr~=value] + [attr~="value"] + ``` - ```css - [title~=flower] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[class~=class2]')) + [Internal link] + ``` `[attribute|=value]` : Represents elements with an attribute named **attribute** whose value is a dash separated list that starts with **value**. - !!! example - The following would select all elements with a `lang` attribute value starting with `en`. + ```css tab="Syntax" + [attr|=value] + [attr|="value"] + ``` - ```css - [lang|=en] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ...
Some text
+ ...
Some more text
+ ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('a[href!="#internal"]')) + [
Some text
,
Some more text
] + ``` `[attribute^=value]` : Represents elements with an attribute named **attribute** whose value starts with **value**. - !!! example - The following selects every `#!html ` element whose `href` attribute value begins with `https`. + ```css tab="Syntax" + [attr^=value] + [attr^="value"] + ``` - ```css - a[href^="https"] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href^=http]')) + [Example link, Example org link] + ``` `[attribute$=value]` : Represents elements with an attribute named **attribute** whose value ends with **value**. - !!! example - The following would select every `#!html ` element whose `href` attribute value ends with `.pdf`. + ```css tab="Syntax" + [attr$=value] + [attr$="value"] + ``` - ```css - a[href$=".pdf"] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href$=org]')) + [Example org link] + ``` `[attribute*=value]` : Represents elements with an attribute named **attribute** whose value containing the substring **value**. - !!! example - The following would select every `#!html ` element whose `href` attribute value contains the substring - `sometext`. + ```css tab="Syntax" + [attr*=value] + [attr*="value"] + ``` - ```css - a[href*="sometext"] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href*="example"]')) + [Example link, Example org link] + ``` `[attribute!=value]` : Equivalent to `#!css :not([attribute=value])`. - !!! example - Selects all elements who do not have a `target` attribute or do not have one with a value that matches `_blank`. + ```css tab="Syntax" + [attr!=value] + [attr!="value"] + ``` - ```css - [target!=_blank] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('a[href!="#internal"]')) + [Example link, Insensitive internal link, Example org link] + ``` `[attribute operator value i]` : Represents elements with an attribute named **attribute** and whose value, when the **operator** is applied, matches - **value** *without* case sensitivity. + **value** *without* case sensitivity. In general, attribute comparison is insensitive in normal HTML, but not XML. + `i` is most useful in XML documents. - !!! example - The following would select any element with a `title` that equals `flower` regardless of case. + ```css tab="Syntax" + [attr=value i] + [attr="value" i] + ``` - ```css - [title=flower i] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href="#INTERNAL" i]')) + [Internal link] + ``` `[attribute operator value s]` : Represents elements with an attribute named **attribute** and whose value, when the **operator** is applied, matches **value** *with* case sensitivity. - !!! example - The following would select any element with a `type` that equals `submit`. Case sensitivity will be forced. + ```css tab="Syntax" + [attr=value s] + [attr="value" s] + ``` - ```css - [type=submit s] - ``` + ```pycon3 tab="Usage" + >>> from bs4 import BeautifulSoup as bs + >>> html = """ + ... + ... + ... + ... + ... + ... + ... """ + >>> soup = bs(html, 'html5lib') + >>> print(soup.select('[href="#INTERNAL" s]')) + [] + >>> print(soup.select('[href="#internal" s]')) + [Internal link] + ``` ### Namespace Selectors -Namespace selectors are used in conjunction with type selectors. They are specified with by declaring the namespace and -the type separated with `|`: `namespace|type`. `namespace` in this context is the prefix defined via the [namespace -dictionary](./api.md#namespaces). The prefix does not need to match the prefix in the document as it is the namespace -that is compared, not the prefix. +Namespace selectors are used in conjunction with type and universal selectors as well as attribute names in attribute +selectors. They are specified by declaring the namespace and the selector separated with `|`: `namespace|selector`. +`namespace`, in this context, is the prefix defined via the [namespace dictionary](./api.md#namespaces). The prefix +defined for the CSS selector does not need to match the prefix name in the document as it is the namespace associated +with the prefix that is compared, not the prefix itself. -The universal selector (`*`) can be used to represent any namespace as it can with type. +The universal selector (`*`) can be used to represent any namespace just as it can with types. -Namespaces can be used with attribute selectors as well except that when `[|attribute`] is used, it is equivalent to -`[attribute]`. - -`*|*` -: - Represents any element with or without a namespace. +By default, type selectors without a namespace selector will match any element whose type matches, regardless of +namespace. But if a CSS default namespace is declared (one with an empty key: `{"": "http://www.w3.org/1999/xhtml"}`), +all type selectors will assume the default namespace unless an explicit namespace selector is specified. For example, +if the default name was defined to be `http://www.w3.org/1999/xhtml`, the selector `a` would only match `a` tags that +are within the `http://www.w3.org/1999/xhtml` namespace. The one exception is within pseudo classes (`:not()`, `:has()`, +etc.) as namespaces are not considered within pseudo classes unless one is explicitly specified. - !!! example - The following would select the `#!html
` element with or without a namespace. +If the namespace is omitted (`|element`), any element without a namespace will be matched. In HTML documents that +support namespaces (XHTML and HTML5), HTML elements are counted as part of the `http://www.w3.org/1999/xhtml` namespace, +but attributes usually do not have a namespace unless one is explicitly defined in the markup. - ```css - *|div - ``` - -`namespace|*` -: - Represents any element with a namespace that is associated with the prefix `namespace` as defined in the [namespace - dictionary](./api.md#namespaces). - - !!! example - The following would select the `#!html ` element with the namespace `svg`. - - ```css - svg|circle - ``` - -`|*` -: - Represents any element with no defined namespace. - - !!! example - The following would select a `#!html
` element that has no namespace. +Namespaces can be used with attribute selectors as well except that when `[|attribute`] is used, it is equivalent to +`[attribute]`. - ```css - |div - ``` +```css tab="Syntax" +ns|element +ns|* +*|* +*|element +|element +[ns|attr] +[*|attr] +[|attr] +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...

SVG Example

+...

Soup Sieve Docs

+... +... +... MDN Web Docs +... +... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('svg|a', namespaces={'svg': 'http://www.w3.org/2000/svg'})) +[MDN Web Docs] +>>> print(soup.select('a', namespaces={'svg': 'http://www.w3.org/2000/svg'})) +[Soup Sieve Docs, MDN Web Docs] +>>> print(soup.select('a', namespaces={'': 'http://www.w3.org/1999/xhtml', 'svg': 'http://www.w3.org/2000/svg'})) +[Soup Sieve Docs] +>>> print(soup.select('[xlink|href]', namespaces={'xlink': 'http://www.w3.org/1999/xlink'})) +[MDN Web Docs] +>>> print(soup.select('[|href]', namespaces={'xlink': 'http://www.w3.org/1999/xlink'})) +[Soup Sieve Docs] +``` ## Combinators and Selector Lists @@ -319,105 +626,225 @@ CSS employs a number of tokens in order to represent lists or to provide relatio ### Selector Lists -Selector lists use the comma (`,`) to join multiple selectors in a list. - -!!! example - The following would select both `#!html
` elements and `#!html

` elements. - - ``` - div, h1 - ``` +Selector lists use the comma (`,`) to join multiple selectors in a list. When presented with a selector list, any +selector in the list that matches an element will return that element. + +```css tab="Syntax" +element1, element2 +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...

Title

+...

Paragraph

+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('h1, p')) +[

Title

,

Paragraph

] +``` ### Descendant Combinator Descendant combinators combine two selectors with whitespace ( ) in order to signify that the second element is matched if it has an ancestor that matches the first element. -!!! example - The following would select all `#!html

` elements inside `#!html

` elements. +```css tab="Syntax" +parent descendant +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...

Paragraph 1

+...

Paragraph 2

+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('body p')) +[

Paragraph 1

,

Paragraph 2

] +``` - ```css - div p - ``` +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator ### Child combinator Child combinators combine two selectors with `>` in order to signify that the second element is matched if it has a parent that matches the first element. -!!! example - The following would select all `#!html

` elements where the parent is a `#!html

` element. +```css tab="Syntax" +parent > child +``` + +```pycon3 tab="Usage" +>>> from bs4 import BeautifulSoup as bs +>>> html = """ +... +... +... +...

Paragraph 1

+...
  • Paragraph 2

+... +... +... """ +>>> soup = bs(html, 'html5lib') +>>> print(soup.select('div > p')) +[

Paragraph 1

] +``` - ```css - div > p - ``` +!!! tip "Additional Reading" + https://developer.mozilla.org/en-US/docs/Web/CSS/Child_combinator ### General sibling combinator General sibling combinators combine two selectors with `~` in order to signify that the second element is matched if it has a sibling that precedes it that matches the first element. -!!! example - The following would select every `#!html