Skip to content

Commit

Permalink
docs: trailing whitespaces
Browse files Browse the repository at this point in the history
  • Loading branch information
eitsupi committed Mar 11, 2024
1 parent 5ab94d9 commit d4caebe
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 20 deletions.
12 changes: 6 additions & 6 deletions altdoc/reference_home.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ to choose between eager and lazy evaluation, that require respectively a
for grouped data).

We can apply functions directly on a `DataFrame` or `LazyFrame`, such as `rename()`
or `drop()`. Most functions that can be applied to `DataFrame`s can also be used
on `LazyFrame`s, but some are specific to one or the other. For example:
or `drop()`. Most functions that can be applied to `DataFrame`s can also be used
on `LazyFrame`s, but some are specific to one or the other. For example:

* `$equals()` exists for `DataFrame` but not for `LazyFrame`;
* `$collect()` executes a lazy query, which means it can only be applied on
* `$collect()` executes a lazy query, which means it can only be applied on
a `LazyFrame`.

Another common data structure is the `Series`, which can be considered as the
Expand Down Expand Up @@ -89,7 +89,7 @@ test$group_by(pl$col("cyl"))$agg(
## Expressions

Expressions are the building blocks that give all the flexibility we need to
modify or create new columns.
modify or create new columns.

Two important expressions starters are `pl$col()` (names a column in the context)
and `pl$lit()` (wraps a literal value or vector/series in an Expr). Most other
Expand Down Expand Up @@ -118,7 +118,7 @@ when it is applied on binary data or on string data.
To be able to distinguish those usages and to check the validity of a query,
`polars` stores methods in subnamespaces. For each datatype other than numeric
(floats and integers), there is a subnamespace containing the available methods:
`dt` (datetime), `list` (list), `str` (strings), `struct` (structs), `cat`
`dt` (datetime), `list` (list), `str` (strings), `struct` (structs), `cat`
(categoricals) and `bin` (binary). As a sidenote, there is also an exotic
subnamespace called `meta` which is rarely used to manipulate the expressions
themselves. Each subsection in the "Expressions" section lists all operations
Expand Down Expand Up @@ -148,7 +148,7 @@ df$with_columns(
)
```

Similarly, to convert a string column to uppercase, we use the `str` prefix
Similarly, to convert a string column to uppercase, we use the `str` prefix
before using `to_uppercase()`:

```{r}
Expand Down
26 changes: 13 additions & 13 deletions vignettes/performance.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ options(rmarkdown.html_vignette.check_title = FALSE)


As highlighted by the [DuckDB benchmarks](https://duckdblabs.github.io/db-benchmark/),
`polars` is very efficient to deal with large datasets. Still, one can make `polars`
`polars` is very efficient to deal with large datasets. Still, one can make `polars`
even faster by following some good practices.


Expand Down Expand Up @@ -100,7 +100,7 @@ will internally check whether it can be optimized, for example by reordering
some operations.

Let's re-use the example above but this time with `polars` syntax and 10M
observations. For the purpose of this vignette, we can create a `LazyFrame`
observations. For the purpose of this vignette, we can create a `LazyFrame`
directly in our session, but if the data was stored in a CSV file for instance,
we would have to scan it first with `pl$scan_csv()`:

Expand Down Expand Up @@ -140,7 +140,7 @@ lazy_query = lf_test$
lazy_query
```

However, this doesn't do anything to the data until we call `collect()` at the
However, this doesn't do anything to the data until we call `collect()` at the
end. We can now compare the two approaches (in the `lazy` timing, calling `collect()`
both reads the data and process it, so we include the data loading part in the
`eager` timing as well):
Expand All @@ -165,11 +165,11 @@ bench::mark(


On this very simple query, using lazy execution instead of eager execution lead
to a 1.7-2.2x decrease in time.
to a 1.7-2.2x decrease in time.

So what happened? Under the hood, `polars` reorganized the query so that it
filters rows while reading the csv into memory, and then sorts the remaining
data. This can be seen by comparing the original query (`describe_plan()`) and
filters rows while reading the csv into memory, and then sorts the remaining
data. This can be seen by comparing the original query (`describe_plan()`) and
the optimized query (`describe_optimized_plan()`):

```{r}
Expand All @@ -179,7 +179,7 @@ lazy_query$describe_optimized_plan()
```


Note that the queries must be read from bottom to top, i.e the optimized query
Note that the queries must be read from bottom to top, i.e the optimized query
is "select the dataset where the column 'country' matches these values, then sort
the data by the values of 'country'".

Expand All @@ -188,13 +188,13 @@ the data by the values of 'country'".

`polars` comes with a large number of built-in, optimized, basic functions that
should cover most aspects of data wrangling. These functions are designed to be
very memory efficient. Therefore, using R functions or converting data back and
very memory efficient. Therefore, using R functions or converting data back and
forth between `polars` and R is discouraged as it can lead to a large decrease in
efficiency.

Let's use the test data from the previous section and let's say that we only want
to check whether each country contains "na". This can be done in (at least) two
ways: with the built-in function `contains()` and with the base R function
ways: with the built-in function `contains()` and with the base R function
`grepl()`. However, using the built-in function is much faster:

```r
Expand All @@ -207,7 +207,7 @@ bench::mark(
grepl("na", s)
})
),
grepl_nv = df_test$limit(1e6)$with_columns(
grepl_nv = df_test$limit(1e6)$with_columns(
pl$col("country")$apply(\(str) {
grepl("na", str)
}, return_type = pl$Boolean)
Expand All @@ -221,12 +221,12 @@ bench::mark(
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 contains 387.02ms 432.12ms 2.27 401.86KB 0
#> 1 contains 387.02ms 432.12ms 2.27 401.86KB 0
#> 2 grepl 2.06s 2.11s 0.466 114.79MB 0.512
#> 3 grepl_nv 6.42s 6.52s 0.153 7.65MB 10.3
```

Using custom R functions can be useful, but when possible, you should use the
Using custom R functions can be useful, but when possible, you should use the
functions provided by `polars`. See the Reference tab for a complete list of
functions.

Expand All @@ -236,7 +236,7 @@ functions.
Finally, quoting [Polars User Guide](https://pola-rs.github.io/polars-book/user-guide/concepts/streaming/):

> One additional benefit of the lazy API is that it allows queries to be executed
> in a streaming manner. Instead of processing the data all-at-once Polars can
> in a streaming manner. Instead of processing the data all-at-once Polars can
> execute the query in batches allowing you to process datasets that are
> larger-than-memory.
Expand Down
2 changes: 1 addition & 1 deletion vignettes/polars.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ column. See the section below for more details on data types.
## Reshape

Polars supports data reshaping, going from both long to wide (a.k.a. "pivotting",
or `pivot_wider()` in `tidyr`), and from wide to long (a.k.a. "unpivotting",
or `pivot_wider()` in `tidyr`), and from wide to long (a.k.a. "unpivotting",
"melting", or `pivot_longer()` in `tidyr`).
Let's switch to the `Indometh` dataset to demonstrate some basic examples.
Note that the data are currently in long format.
Expand Down

0 comments on commit d4caebe

Please sign in to comment.