Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement $write_csv() for DataFrame #414

Merged
merged 30 commits into from
Oct 18, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c0be6aa
start work on write_csv()
etiennebacher Oct 6, 2023
41397e4
add some checks and tests
etiennebacher Oct 6, 2023
fd27a98
add some docs
etiennebacher Oct 6, 2023
a05e98f
simplify rust side
etiennebacher Oct 6, 2023
769ee80
simplify rust side
etiennebacher Oct 6, 2023
296cec1
remove null_value check on R side
etiennebacher Oct 9, 2023
7a60d23
Merge branch 'main' into write_csv
etiennebacher Oct 9, 2023
72726a9
Merge branch 'main' into write_csv
etiennebacher Oct 9, 2023
5f7b95a
Merge branch 'main' into write_csv
etiennebacher Oct 16, 2023
7d48692
Merge branch 'main' into write_csv
eitsupi Oct 16, 2023
3861b30
auto formatting by `make all`
eitsupi Oct 16, 2023
6c221fe
test: use snapshot tests
eitsupi Oct 16, 2023
ab4253f
docs: fix typo
eitsupi Oct 16, 2023
d4ab52c
refactor(test): use helper function
eitsupi Oct 16, 2023
a0b661d
test: tests for quote_style
eitsupi Oct 16, 2023
ada992b
remove check for path on the R side
etiennebacher Oct 16, 2023
5025017
update tests
etiennebacher Oct 16, 2023
dee4402
bump news
etiennebacher Oct 16, 2023
460bf3b
add some tests for date_format, time_format and datetime_format
etiennebacher Oct 17, 2023
8a55449
test float_precision
etiennebacher Oct 17, 2023
3229471
add examples
etiennebacher Oct 17, 2023
3539a04
remove time_format test for now
etiennebacher Oct 17, 2023
53cfc9c
more robj_to!(x,y)?, add QuoteStyle, Utf8Byte
sorhawell Oct 17, 2023
d8949bb
unit test new robj_to! conversions and errors
sorhawell Oct 17, 2023
66f43f9
merge main solve NEWS conflict
sorhawell Oct 17, 2023
1e5bcba
do not export
etiennebacher Oct 18, 2023
5e5473a
remove unused utils
etiennebacher Oct 18, 2023
55a8358
uncomment a test
etiennebacher Oct 18, 2023
d53f912
try to fix test
etiennebacher Oct 18, 2023
2e2a4a8
refactor: auto formatting
eitsupi Oct 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ S3method(tail,LazyFrame)
S3method(unique,DataFrame)
S3method(unique,LazyFrame)
export(.pr)
export(DataFrame_write_csv)
export(as_polars_series)
export(knit_print.DataFrame)
export(pl)
Expand Down
80 changes: 80 additions & 0 deletions R/dataframe__frame.R
Original file line number Diff line number Diff line change
Expand Up @@ -1680,3 +1680,83 @@ DataFrame_sample = function(
) |>
unwrap("in $sample():")
}



#' Write to comma-separated values (CSV) file
#'
#' @param path File path to which the result should be written.
#' @param has_header Whether to include header in the CSV output.
#' @param separator Separate CSV fields with this symbol.
#' @param line_terminator String used to end each row.
#' @param quote Byte to use as quoting character.
#' @param batch_size Number of rows that will be processed per thread.
#' @param datetime_format A format string, with the specifiers defined by the
#' chrono Rust crate. If no format specified, the default fractional-second
#' precision is inferred from the maximum timeunit found in the frame’s Datetime
#' cols (if any).
#' @param date_format A format string, with the specifiers defined by the chrono
#' Rust crate.
#' @param time_format A format string, with the specifiers defined by the chrono
#' Rust crate.
#' @param float_precision Number of decimal places to write, applied to both
#' Float32 and Float64 datatypes.
#' @param null_values A string representing null values (defaulting to the empty
#' string).
#' @param quote_style Determines the quoting strategy used.
#' * "`necessary"` (default): This puts quotes around fields only when necessary.
#' They are necessary when fields contain a quote, delimiter or record
#' terminator. Quotes are also necessary when writing an empty record (which
#' is indistinguishable from a record with one empty field). This is the
#' default.
#' * `"always"`: This puts quotes around every field.
#' * `"non_numeric"`: This puts quotes around all fields that are non-numeric.
#' Namely, when writing a field that does not parse as a valid float or integer,
#' then quotes will be used even if they aren`t strictly necessary.

# TODO: include "never" when bumping rust-polars to 0.34
# * `"never"`: This never puts quotes around fields, even if that results in
# invalid CSV data (e.g.: by not quoting strings containing the separator).

#' @return
#' This doesn't return anything but creates a CSV file.
#' @export
#' @rdname IO_write_csv
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
#'
#' @examples
#' dat = pl$DataFrame(mtcars)
#'

DataFrame_write_csv = function(
path,
has_header = TRUE,
separator = ",",
line_terminator = "\n",
quote = '"',
batch_size = 1024,
datetime_format = NULL,
date_format = NULL,
time_format = NULL,
float_precision = NULL,
null_values = "",
quote_style = "necessary"
) {

if (file_ext(path) != "csv") {
stop("Argument `path` must the path to a CSV file.")
}
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved

if (length(quote_style) == 0 ||
!quote_style %in% c("always", "necessary", "non_numeric")) {
stop("Argument `quote_style` must be one of 'always', 'necessary', or 'non_numeric'.")
}
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved

.pr$DataFrame$write_csv(
self,
path, has_header, utf8ToInt(separator), line_terminator, utf8ToInt(quote), batch_size,
datetime_format, date_format, time_format, float_precision,
null_values, quote_style
) |>
unwrap("in $write_csv():") |>
invisible()
}
10 changes: 10 additions & 0 deletions R/extendr-wrappers.R
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,16 @@ DataFrame$sample_n <- function(n, with_replacement, shuffle, seed) .Call(wrap__D

DataFrame$sample_frac <- function(frac, with_replacement, shuffle, seed) .Call(wrap__DataFrame__sample_frac, self, frac, with_replacement, shuffle, seed)

DataFrame$write_csv <- function(path, has_header, separator, line_terminator, quote, batch_size,
datetime_format, date_format, time_format, float_precision,
null_values, quote_style) .Call(wrap__DataFrame__write_csv, self, path,
has_header, separator,
line_terminator, quote,
batch_size,
datetime_format, date_format,
time_format, float_precision,
null_values, quote_style)

#' @export
`$.DataFrame` <- function (self, name) { func <- DataFrame[[name]]; environment(func) <- environment(); func }

Expand Down
6 changes: 6 additions & 0 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -639,3 +639,9 @@ is_bool = function(x) {
dtypes_are_struct = function(dtypes) {
sapply(dtypes, \(dt) pl$same_outer_dt(dt, pl$Struct()))
}

# from tools::file_ext()
file_ext <- function(x) {
pos <- regexpr("\\.([[:alnum:]]+)$", x)
ifelse(pos > -1L, substring(x, pos + 1L), "")
}
74 changes: 74 additions & 0 deletions man/IO_write_csv.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

52 changes: 52 additions & 0 deletions src/rust/src/rdataframe/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ use crate::rdatatype::RPolarsDataType;
use crate::robj_to;
use crate::rpolarserr::*;

use polars::prelude::{CsvWriter, QuoteStyle, SerWriter};

pub use lazy::dataframe::*;

use crate::conversion_s_to_r::pl_series_to_list;
Expand Down Expand Up @@ -442,7 +444,57 @@ impl DataFrame {
.map_err(polars_to_rpolars_err)
.map(DataFrame)
}

pub fn write_csv(
&mut self,
path: Robj,
has_header: Robj,
separator: Robj,
line_terminator: Robj,
quote: Robj,
batch_size: Robj,
datetime_format: Robj,
date_format: Robj,
time_format: Robj,
float_precision: Robj,
null_value: Robj,
quote_style: Robj,
) -> RResult<()> {

let null = robj_to!(String, null_value).unwrap();
let path = robj_to!(str, path).unwrap();
let f = std::fs::File::create(path).unwrap();
let qs = parse_quote_style(quote_style);

CsvWriter::new(f)
.has_header(robj_to!(bool, has_header).unwrap())
.with_delimiter(robj_to!(u8, separator).unwrap())
.with_line_terminator(robj_to!(String, line_terminator).unwrap())
.with_quoting_char(robj_to!(u8, quote).unwrap())
.with_batch_size(robj_to!(usize, batch_size).unwrap())
.with_datetime_format(robj_to!(Option, String, datetime_format).unwrap())
.with_date_format(robj_to!(Option, String, date_format).unwrap())
.with_time_format(robj_to!(Option, String, time_format).unwrap())
.with_float_precision(robj_to!(Option, usize, float_precision).unwrap())
.with_null_value(null)
.with_quote_style(qs)
.finish(&mut self.0)
.map_err(polars_to_rpolars_err)
}
}

pub fn parse_quote_style(x: Robj) -> QuoteStyle {
match robj_to!(Option, String, x).unwrap_or_default().unwrap().as_str() {
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
"always" => QuoteStyle::Always,
"necessary" => QuoteStyle::Necessary,
"non_numeric" => QuoteStyle::NonNumeric,
// "never" is available in rust-polars devel only for now (will be added in 0.34)
// "never" => QuoteStyle::Never,
_ => panic!("polars internal error: `quote_style` must be 'always', 'necessary' or 'non_numeric'.")
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
}
}


impl DataFrame {
pub fn to_list_result(&self) -> Result<Robj, pl::PolarsError> {
//convert DataFrame to Result of to R vectors, error if DataType is not supported
Expand Down
66 changes: 66 additions & 0 deletions tests/testthat/test-csv.R
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,69 @@ test_that("csv read iris", {
iris
)
})


dat = mtcars
dat[c(1, 3, 9, 12), c(3, 4, 5)] = NA
dat_pl = pl$DataFrame(dat)
temp_noext = tempfile()
temp_out = tempfile(fileext = ".csv")

test_that("write_csv: path works", {
expect_error(
dat_pl$write_csv(temp_noext),
"must the path to a CSV file"
)

dat_pl$write_csv(temp_out)
expect_identical(
pl$read_csv(temp_out)$to_data_frame(),
dat,
ignore_attr = TRUE # rownames are lost when writing / reading from CSV
)
})

test_that("write_csv: null_values works", {
expect_error(
dat_pl$write_csv(temp_out, null_values = NULL)
)
dat_pl$write_csv(temp_out, null_values = "hello")
tmp = pl$read_csv(temp_out)$to_data_frame()
expect_true(is.character(tmp$disp) && is.character(tmp$hp) && is.character(tmp$drat))
expect_equal(tmp[1:2, "disp"], c("hello", "160.0"))
})


test_that("write_csv: separator works", {
dat_pl$write_csv(temp_out, separator = "|")
expect_identical(
pl$read_csv(temp_out, sep = "|")$to_data_frame(),
dat,
ignore_attr = TRUE # rownames are lost when writing / reading from CSV
)
})

test_that("write_csv: quote_style and quote works", {
dat_pl2 = pl$DataFrame(iris)

expect_error(
dat_pl2$write_csv(temp_out, quote_style = "foo"),
"must be one of"
)

dat_pl2$write_csv(temp_out, quote_style = "always", quote = "+")
expect_identical(
head(pl$read_csv(temp_out)$to_data_frame()[["+Sepal.Length+"]], n = 2),
c("+5.1+", "+4.9+")
)

dat_pl2$write_csv(temp_out, quote_style = "non_numeric", quote = "+")
expect_identical(
head(pl$read_csv(temp_out)$to_data_frame()[["+Sepal.Length+"]], n = 2),
c(5.1, 4.9)
)
expect_identical(
head(pl$read_csv(temp_out)$to_data_frame()[["+Species+"]], n = 2),
c("+setosa+", "+setosa+")
)
})