Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of spacy_install #243

Merged
merged 26 commits into from
Dec 6, 2023
Merged

Rewrite of spacy_install #243

merged 26 commits into from
Dec 6, 2023

Conversation

JBGruber
Copy link
Collaborator

@JBGruber JBGruber commented Aug 29, 2023

Addresses #236. I use some newer functions from reticulate that were introduced since spacyr was developed. The installation process is now significantly easier (but there are also less options available for users). The default is to install spaCy in a virtual environment managed by reticulate (after checking if a suitable Python bin is available and installing it if not).

I tested it on two Linux machines (Arch and Debian) and Windows 10+11. On the Arch machine I also installed the GPU version. Apple silicone can also be used easily. The installation worked without hiccups (once C dependencies and reticulate were updated).

I assume that most people can now run spacy_install() without prior knowledge or any specific setup. But I still want to document how one could use a manual install by setting either SPACY_PYTHON or RETICULATE_PYTHON -- which should also be used as a troubleshooting guide. I think this should be a vignette rather than the landing page of https://spacyr.quanteda.io (but that might be a different PR).

Let me know what you think 😁

@JBGruber
Copy link
Collaborator Author

I changed the installation.rmd. Now all that is left is to adapt the tests.

@JBGruber
Copy link
Collaborator Author

Unfortunatly, it seems that remove_symbols does not work correctly anymore:

library(spacyr)
txt <- "This: £ = GBP! 15% not! > 20 percent?"
spacy_tokenize(txt, remove_symbols = TRUE, padding = FALSE)
#> successfully initialized (spaCy Version: 3.6.1, language model: en_core_web_sm)
#> $text1
#>  [1] "This"    ":"       "="       "GBP"     "!"       "15"      "%"      
[8] "not"     "!"       ">"       "20"      "percent" "?"

Created on 2023-08-31 with reprex v2.0.2

This is due to an upstream change in spaCy though.

import spacy
doc = nlp("This: £ = GBP! 15% not! > 20 percent?")

for t in doc:
    print(t, t.pos_)
>>>  This PRON
>>>  : PUNCT
>>>  £ PROPN
>>>  = X
>>>  GBP PROPN
>>>  ! PROPN
>>>  15 NUM
>>>  % NOUN
>>>  not PART
>>>  ! PUNCT
>>>  > X
>>>  20 NUM
>>>  percent NOUN
>>>  ? PUNCT

To test if something is a symbol, you use w.pos_ == "SYM", which does not work anymore since = is now classified as X for some reason. I remove the test.

@JBGruber
Copy link
Collaborator Author

There are a few more inconsitencies that lead to failing tests. Where I think those have nothing to do with this PR, but instead are upstream changes, I skip them with skip("behaviour changed in spaCy"). Otherwise tests succeed (at least on my machine) now.

@JBGruber JBGruber marked this pull request as ready for review August 31, 2023 21:52
@JBGruber JBGruber requested a review from kbenoit August 31, 2023 21:52
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really welcome @JBGruber. Sorry it took me so long to get to it. I incremented the verson to 1.3. It's full of breaking changes but only to spacy_install(). So be it.

@kbenoit kbenoit merged commit 28feb2b into quanteda:master Dec 6, 2023
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants