Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stringdot text classification input format #11

Closed
frankiethull opened this issue Oct 2, 2024 · 4 comments
Closed

stringdot text classification input format #11

frankiethull opened this issue Oct 2, 2024 · 4 comments
Labels
engine engine topic enhancement New feature or request kernel kernel binding

Comments

@frankiethull
Copy link
Owner

Based on the documentation, the S4 method for kernlab::ksvm with stringdot requires "list" inputs. Both the predictor as list and label as a list, instead of formula and data.frame.

Inputs for text classification will require additional steps to remain tidy.

these steps will be tested in the popcorn_garland branch (e.g. string kernel pun intended).

@frankiethull frankiethull added enhancement New feature or request kernel kernel binding labels Oct 2, 2024
@frankiethull
Copy link
Owner Author

#9 (comment)
@simonpcouch - hollering now related to a list method 😉

this is one of the kernels I had initially skipped. The example below works but not sure how to bind in parsnip in a tidy way.

Using the underlying package, I have to format two lists, instead of data.frame and formula ..


# Create two separate lists for descriptions and labels
descriptions <- list(
  "Yellow kernels on a cob",
  "Grows in tall stalks in fields",
  "Sweet vegetable with husks",
  "Golden corn ready for harvest",
  "Juicy corn kernels on the cob",
  "Corn silk hanging from the husk",
  "Rows of kernels on a green stalk",
  "Corn ears wrapped in leaves",
  
  "Red apple growing on a tree",
  "Green leaves on a bush",
  "Orange carrot in the ground",
  "Purple grapes on a vine",
  "Brown potato from the soil",
  "Yellow banana in a bunch",
  "Red tomato on the vine",
  "Green broccoli florets"
)

labels <- factor(c(rep("corn", 8), rep("not corn", 8)))

# Train the SVM model using ksvm with stringdot kernel, the S4 method requires lists for stringdot!
svm_model <- ksvm(descriptions, labels,
                  kernel = "stringdot",
                  kpar = list(length = 4, lambda = 0.5),
                  C = 1)

The issue is the non-tidy inputs for text, both being lists. This kernel doesn't seem to work with data.frames or formulas.

Are there any engines already bound to parsnip that require this type of (x,y) list input? I sifted through a few source codes but didn't see any. I was hoping to handle the lists in the model registration, even if this kernel only works with fit_xy. Appreciate any guidance before I go in the wrong direction wrapping ksvm with another function converting formula and data.frame into lists for this kernel.

@frankiethull frankiethull added the engine engine topic label Oct 2, 2024
@simonpcouch
Copy link

Ahhh, hm. The interface slot of set_fit(value) is what comes to mind here, where "data.frame" might be able to handle x as a list, but parsnip wasn't designed to handle that input format and may still trip up.

What we do in some situations where fit functions don't have an interface that aligns with parsnip's expectation is write our own wrappers that either take formula + data or x + y (where x is a data.frame) and then do minimal conversions to interface with the modeling engine itself. So, in this example, you'd write a wrapper (say, k_svm()) around ksvm() that takes ksvm(x) as as.data.frame(x) and then extracts out the list of interest and passes it to ksvm(), and then register k_svm() with parsnip!

@frankiethull
Copy link
Owner Author

thanks for the feedback @simonpcouch! Will do some testing, I was on route to do the second option (a minimal wrapper) but glad you shared the lightgbm example to reference!

@frankiethull
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
engine engine topic enhancement New feature or request kernel kernel binding
Projects
None yet
Development

No branches or pull requests

2 participants