-
Notifications
You must be signed in to change notification settings - Fork 91
[Norwegian] Some pages are not being scrapped properly #33
Comments
Seems fixed in 39ba274 |
Seems fixed indeed. Thank you a LOT. Is there anywhere I can send you a few bucks to? Paypal? |
Appreciate it but, it's a hobby project so that's not necessary :D |
And your hobby project is incredibly helpful to me, so if you change your mind and I ever see a donation page/button on the main page, I'll use it ^^ Actually found one more under løsrive, it's missing the inflection part - https://en.wiktionary.org/wiki/l%C3%B8srive#Norwegian_Bokm%C3%A5l
EDIT: And one more in Bokmål - øl - it strips the first inflection line https://en.wiktionary.org/wiki/%C3%B8l#Norwegian_Bokm%C3%A5l
|
Inflections seem to be turning up properly now, although they're a part of the definition text itself |
Amazing, looking forwards to a new release ^^ |
That seems to have broken more than it fixed.
and after in 0.0.91:
Here's more that broke for testing (first word of every line is what the entry is for, this is a diff)- |
Whoops, added a fix in another release |
Okay, that looks much better, just a few things. My scripts operate on the assumption that the inflections are before the first line break. Am unsure if that was true for every word in 0.0.8, but it certainly was for 99.9%+ of them. In 0.0.92 this is now not the case with Other than that it seems to have broken a single word -
|
Some of the inflections are in multiple lines so they'll be parsed that way. I've gonna fix inflection parsing for other words like |
Not sure if same problem as https://en.wiktionary.org/wiki/maldivisk#Norwegian_Bokm%C3%A5l BTW: I rewrote the detection part of my script, it seems to be working great, thanks for the fixes! |
Added some changes in 2ba2eea to fix this. Also, the definition text is now a list so you may have to change your script |
Finally kicked myself to work on my script again, changes look awesome, thanks! |
Okay I only looked at my inflections output, premature celebration. Your changes at some point seemed to have added garbage in the form of the word name to some words. https://en.wiktionary.org/wiki/forrevet forrevet has a definition 'forrevet' which really shouldn't be there for example.
https://en.wiktionary.org/wiki/foreskrevet has the exact same issue and am sure there's a bunch of others |
I haven't encountered multiple subheadings under a definition yet. The subheadings usually contain inflections so the parser adds that to the list of definitions. I guess it should either not include them or separate them out from the definition list, probably in a field called |
Yeah, it should separate it, or not do that, as I can't simply filter out if word X contains definition X because some words really are that way (best in bokmål means best). If you need more examples where this happens - støvete, uomskåret, |
It looks like one of the updates also broke nested definitions https://en.wiktionary.org/wiki/v%C3%A6re_glad_i They weren't exactly scrapped perfectly in the first place it seems, but now they're not scrapped at all. |
Nested definitions and examples have ambiguous formatting so figuring that out is going to take some time |
I've had luck with the Wiktionary contributors willing to redo old formatting and use a newer template for some snowflake definitions I ran into. Not sure if these nested words are the case, I could ask about them, but that'd require me to go through the diff and pick them out, which right now has a lot of "garbage" I mentioned above, and it'd be a pain to go through it in this state. |
Out of all the issues I opened here this one is the most important to me as I've used this project for creation of a Kindle-compatible dictionary, and incomplete/missing entries are the bane of every dictionary project ^^
https://en.wiktionary.org/wiki/seg#Norwegian_Bokm%C3%A5l
Missing completely
https://en.wiktionary.org/wiki/ham#Norwegian_Bokm%C3%A5l
Missing completely
https://en.wiktionary.org/wiki/by#Norwegian_Bokm%C3%A5l
Missing the verb definition
Here's a list of errors from my project for words in Norwegian Bokmål. It is totally possible that some errors are due to a mistake in my own scripts, but all I checked were thrown due to WiktionaryParser not parsing them properly or at all.
https://haste.rys.pw/raw/vevafamiwo
Another half-broken entry -
https://en.wiktionary.org/wiki/for#Norwegian_Bokm%C3%A5l
The text was updated successfully, but these errors were encountered: