-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide context #58
base: master
Are you sure you want to change the base?
Provide context #58
Conversation
Thanks for taking a look at this. The way I see it there are a couple of things that need to change before this feature can be added. Trying to get the positions of tokens is a little harder than it first seems. You have to assume that by the time the incoming document has passed through the pipeline it bears little resemblance to the original text of the document. This is due to stemming, stop word removal or any other form of text processing the user has added to the pipeline. This means that, as part of the pipeline, probably the very first step, the start index of the token in a string, must be tracked. Currently tokens have no meta data attached to them as they passthrough the pipeline, this would need to change, and this is potentially incompatible with any pipeline function anyone else has created. There are two ways (that I can think of now) to attach this meta data to a token. Instead of just passing around strings, create a The first way, creating a Just adding extra properties onto strings would allow us to attach the position meta data, however it feels a bit sloppy to me, on the upside it would remain backwards compatible with any end user code in the pipeline. Currently I'm leaning more towards creating a With the correct position of the matching token returned from the search, some additional code, perhaps a plugin for lunr, could then handle storing the original document, and then using the position to generate a useful preview of the search result. I'd be very interested in your thoughts on this, or any other suggestions you might have. |
I think I follow what you're saying. You're basically against polluting the While I think backwards compatibility is important, I think progress is more important (:grinning:), and not mucking around with a language's internals is best. A bump of a major revision should basically guarantee not to brake I'll see if I can sit down and figure out some way to achieve this. Thanks for your feedback! |
This is a start to resolving #25.
Right now it seems to be pretty easy to find the original position of a matching token, as I've done here. What I'm having a hard time doing is getting the original, untokenized document back. It seems to be impossible currently. I'm very new to the code, though, and I maybe be overlooking something.
What I'd like to do is this:
Given a text of:
Professor Plumb has a green plant in his study
And a search of:
green plant
I want to get the index of the matched token (in this case,
24
).From there, I'd like to devise some sort of extraction, so that the result returned to the user is actually something like a summary:
...has a green plant in his...
. This way, the reader can kind of get a "preview" of what the search yielded in the doc. (This is probably best handled outside the core code.)