Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement unnest() for LazyFrame #397

Merged
merged 7 commits into from
Sep 26, 2023
Merged

Implement unnest() for LazyFrame #397

merged 7 commits into from
Sep 26, 2023

Conversation

etiennebacher
Copy link
Collaborator

This works when we specify names but not when we don't pass anything, while it should unnest all Struct columns. The problem is that:

  • I don't know how to check on the R side which columns are Struct because it's special DataType format compared to Float64 for example:
library(polars)
pl$set_options(do_not_repeat_call = TRUE)

test <- pl$DataFrame(iris[, 1:2])$
  with_columns(pl$col("Sepal.Width")$to_struct())

test$dtypes[[1]] == pl$Float64
#> [1] TRUE

test$dtypes[[2]] == pl$Struct()
#> Error: Execution halted with the following contexts
#>    0: In R: in pl$Struct:
#>    1: subscript out of bounds
  • I don't know how to check on Rust side because it needs some dtype or columns methods that themselves require an iter method.

@sorhawell could you take a look for the case when names = NULL? Is it possible a helper is_struct() on the R side?


Works fine when we specify names:

library(polars)

df = pl$LazyFrame(a = 1:5, b = c("one", "two", "three", "four", "five"))$
  select(
    pl$all()$to_struct()$alias("mystruct")
  )

df$collect()
#> shape: (5, 1)
#> ┌─────────────┐
#> │ mystruct    │
#> │ ---         │
#> │ struct[2]   │
#> ╞═════════════╡
#> │ {1,"one"}   │
#> │ {2,"two"}   │
#> │ {3,"three"} │
#> │ {4,"four"}  │
#> │ {5,"five"}  │
#> └─────────────┘
df$unnest("mystruct")$collect()
#> shape: (5, 2)
#> ┌─────┬───────┐
#> │ a   ┆ b     │
#> │ --- ┆ ---   │
#> │ i32 ┆ str   │
#> ╞═════╪═══════╡
#> │ 1   ┆ one   │
#> │ 2   ┆ two   │
#> │ 3   ┆ three │
#> │ 4   ┆ four  │
#> │ 5   ┆ five  │
#> └─────┴───────┘

@sorhawell
Copy link
Collaborator

sorhawell commented Sep 25, 2023

Title: Implement unnest() for LazyFrame

yes - that it usefull to have

"I don't know how to check on Rust side because it needs some dtype or columns methods that themselves require an iter method" + "Is it possible a helper is_struct() on the R side?"

There is pl$same_outer_dt() to check if outer DataType is equal

library(polars)
pl$set_options(do_not_repeat_call = TRUE)
dtypes_are_struct = \(dtypes) sapply(dtypes, \(dt) pl$same_outer_dt(dt,pl$Struct(pl$UInt8))) # or what ever inner type

test <- pl$DataFrame(iris[, 1:2])$
  with_columns(pl$col("Sepal.Width")$to_struct())

test$dtypes |> dtypes_are_struct()
> FALSE  TRUE

Diving in to the error 1: subscript out of bounds you got which is do to a bug.

# we cannot currently make an empty Struct datatype with no inner Fields = (stringname, DataType),
# however though rarely, it is a valid type, and if fixing the bug this would happen
pl$Struct()

> DataType: Struct(
    [],
)

#in py-polars it possible to create an empty Struct DataType, so we should be able to do that too.
pl.Struct()
> Struct
pl.Struct([pl.Int64]) # just to show how inner types are printed in py-polars
> Struct([Int64]) 

#but it is still not possible similarly to create an empty struct as so
pl.struct()
> ... PanicException: index out of bounds: the len is 0 but the index is 0

# note
# struct spelled with minor s, is the "struct" as a lazy Expr or Eager Series
# Struct with captital is the datatype of a struct
# similarly List is the datatype of list
# DataType is the class name of a polars datatype

# probably same error in R 
pl$struct(list())
 > polars Expr: thread '<unnamed>' panicked at 'index out of bounds: the len is 0 but the index is 0'

# or this error in R. As py-polars has not defined what pl.struct() should do, I guess any error
# should do for now
> pl$struct()
Error in pl$struct() : argument "exprs" is missing, with no default
In addition: Warning message:
In str(x) : restarting interrupted promise evaluation

could you take a look for the case when names = NULL?

not sure I understood, it would not be meaningful to have a struct or a DataFrame with undefined names.

# in py-polares
pl.struct([1,2],eager = True)
> DuplicateError: multiple fields with name 'literal' found
pl$struct(list(1, 2),eager = TRUE)
Error: Execution halted with the following contexts
0: In R: in pl$struct:
0: During function call [pl$struct(list(1, 2), eager = TRUE)]
1: Encountered the following error in Rust-Polars:
duplicate: multiple fields with name 'literal' found

maybe you mean the unnest(names = NULL) ? what about it?

"Works fine when we specify names: ..."

I don't quite follow what you are aiming for here. If you do not use alias the

@sorhawell
Copy link
Collaborator

Ohh I read this page as an issue not a PR sorry @etiennebacher . Makes a lot more sense now :)

@etiennebacher
Copy link
Collaborator Author

Thank you for the explanation @sorhawell, I'll wait for #398 to be fixed before completing this one

@etiennebacher etiennebacher marked this pull request as draft September 26, 2023 06:55
@etiennebacher etiennebacher marked this pull request as ready for review September 26, 2023 08:41
@etiennebacher
Copy link
Collaborator Author

@sorhawell I used this occasion to do the detection of struct columns on the R side rather than rust for both Dataframe and Lazyframe. I don't think we lose a lot of speed or memory and the rust code is cleaner

@sorhawell
Copy link
Collaborator

I don't think so either, unless there are 10k columns and even then probably less than 1s :)

Copy link
Collaborator

@sorhawell sorhawell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me 👨‍🎤

@etiennebacher etiennebacher merged commit 8090f6e into main Sep 26, 2023
11 checks passed
@etiennebacher etiennebacher deleted the unnest-lazyframe branch September 26, 2023 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants