Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation search results relevance improvements #1097

Open
boxed opened this issue Jun 7, 2021 · 18 comments
Open

Documentation search results relevance improvements #1097

boxed opened this issue Jun 7, 2021 · 18 comments

Comments

@boxed
Copy link

boxed commented Jun 7, 2021

Searching for "through" finds nothing:

This search should link at least these:

https://docs.djangoproject.com/en/3.2/topics/db/models/#extra-fields-on-many-to-many-relationships
https://docs.djangoproject.com/en/3.2/ref/models/fields/#django.db.models.ManyToManyField.through

@pauloxnet
Copy link
Member

In English, the word "through" is a stopword and is ignored in the search against the English dictionary used in PostgreSQL.

From the PostgreSQL documentation:

Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index.

The English word "through" is not a stopword in another dictionary for example the Italian dictionary, and in fact the search in this language shows results:

https://docs.djangoproject.com/it/3.2/search/?q=through

@boxed
Copy link
Author

boxed commented Jun 8, 2021

I figured it might be something like that. Framework function names and stuff should bypass that logic somehow.

@pauloxnet
Copy link
Member

@boxed I don't think "through" is the only stopwords that matter in search.
Perhaps it would be useful to have a list of these words and then think of a way to ensure that they are not discarded.
Could you write a list of words not to be deleted starting from the official PostgreSQL stopwords list?
https://github.com/postgres/postgres/blob/master/src/backend/snowball/stopwords/english.stop

@boxed
Copy link
Author

boxed commented Jun 8, 2021

Hm.. I don't know about a complete list. But certainly "where" is suspicious as it's a keyword in SQL. This becomes a bit tricky as "where" should probably just be searched when it's in a code block like select or similar. I think "now" is a bit doubtful it should be excluded too as it could be something you want to search for like datetime.now which I guess the current implementation just interprets as "datetime".

That's what I could find reading through this list. I think one could image a solution where the search is run and there are no hits, then it's re-run but ignoring stopwords. This would fix the worst case at least.

@stale
Copy link

stale bot commented Oct 4, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 4, 2022
@pauloxnet
Copy link
Member

We can try to create a custom English dictionary without relevant words for Django.

@stale stale bot removed the stale label Oct 4, 2022
@thibaudcolas thibaudcolas changed the title Search for "through" finds nothing Documentation search for "through" finds nothing Mar 26, 2024
@thibaudcolas thibaudcolas changed the title Documentation search for "through" finds nothing Documentation search results relevance improvements Mar 26, 2024
@thibaudcolas
Copy link
Member

thibaudcolas commented Mar 26, 2024

Noting the issue with stopwords – and also from #1496, we got the following recommendation:

Optimize the Documentation Search Algorithm: Evaluate the improvement of the internal search algorithm to provide more accurate and relevant results in response to user queries.

I’ve re-titled the issue accordingly so we consider more improvements than just stopwords refinements.

Related: Site-wide search #1499.

@boxed
Copy link
Author

boxed commented Mar 26, 2024

Considering this simple and limited scope change has seen no improvement in several years, I don't think broadening the scope of the issue is a good idea.

Talking about this issue not moving forward... Could we maybe consider building something simple in front of the current code that does a very simple string matching on just the titles in the documentation and showing that first? Maybe other hard coded searches could be added too, since for example searching "group by" shows nothing of relevance.

@thibaudcolas
Copy link
Member

thibaudcolas commented Mar 26, 2024

If anyone really wants to fix the issue with stopwords only – that’s still as welcome as it was until now.

This is a volunteer-run project, and this hasn’t been picked up in three years of it being defined as quite a narrow improvement. I think putting this in the broader context of search improvements will make it clearer to potential contributors what the goal here is. Personally what I’d like to see is a more strategic approach to this where we look at analytics on what searches are being made that have 0 results.

I don’t like the idea of hard-coded searches as we simply don’t have the capacity to maintain that kind of content. I’d rather we set up boosting based on headings (if that’s not already the case).

@boxed
Copy link
Author

boxed commented Mar 26, 2024

I agree on the statistics being very useful.

@thibaudcolas
Copy link
Member

We’ve decided the next steps are:

  • Review current common searches as part of Docs search: tweak results ranking so release notes have lower priority #1628. This will help understand how often we have stopwords issues
  • Document the search algorithm so we can make more informed decision (see search.py)
  • We need to decide if it’s worthwhile to change the Postgres stopwords dictionary for english, or otherwise look at alternatives.

@jacklinke
Copy link

@thibaudcolas What is the status (if any) on getting data for docs searches over some period of time?

Is the ops team tracking this need? I think we asked someone from ops about it during Sprints at DjangoCon US, but my memory isn't always great, and I'm not sure if there is a formal process for making a request like this or if the working group simply asking ops is sufficient.

@thibaudcolas
Copy link
Member

@jacklinke yup, see #1628.

@thibaudcolas
Copy link
Member

thibaudcolas commented Nov 23, 2024

To help us consider this and similar search improvements, I’ve requested help from Algolia to get the Django docs indexed in their Algolia DocSearch program. They provide free access to their Algolia Search product, for projects looking for developer documentation search.

Here’s where you can trial how it works: Trial: Algolia DocSearch.

This page is only intended to try out a different search implementation so we can improve ours, like we also have the Sphinx search setup available on django.readthedocs.io. I’ve only set it up to index Django 5.1 in English at this time.


For through specifically, its first 3 results are about the "through" keyword as it’s used in Django, and then the other ones after that are when through wouldn’t be meaningful. It also works well when searching for "trough" as I’ve configured it to support that kind of typo-friendly matching.

Screenshot of the through results:

Screenshot of AlgoliaDocSearch UI searching for through, 5 results visible

Beyond through, the aspects that are interesting to me are:

  • Its indexed documents are page sections rather than whole pages. This feels immensely useful in the case of the Django developer docs, as the pages tend to be very long (see Friction in finding specific information via a web search #1734). Does that feel doable with the Postgres search?
  • Its UI being a type-ahead widget feels immensely more forgiving. If I don’t like the results, I can type something else and get new ones right away (see Improve the UX/UI in the search results page (to help choosing the right results) #1229). That seems like something we could do with the existing Django search relatively easily?
  • It comes with search analytics, so I can tell our documentation contributors exactly what searches are most common and which ones had no results in the last day / week / month. I can even give them direct access to just that.

If anyone would like access to the behind-the-scenes search admin please let me know. I use DocSearch for other projects so can give you a tour.

@pauloxnet
Copy link
Member

pauloxnet commented Nov 23, 2024

I don't think the best solution is to use an external engine here. We have spent effort and time to use Django itself for the search in the documentation and remove a lot of issues from the elastic search synchronization.

The search function just needs a little tweaking. There have been many complaints over the years but unfortunately little help. I'm glad to see some interest in this area.

I still think that the Django website is also a showcase to demonstrate its potential as a web framework, and using an external search engine would be like admitting that Django's full-text search is not good enough to be used in a web portal.

I would use the necessary forces to integrate an external engine to improve the search we already have and also the documentation.

@boxed
Copy link
Author

boxed commented Nov 23, 2024

Algolia looks very nice and fast, and it's great to see that it handles the sectioned docs (which google, readthedocs, and the current system all fail on).

@alexgmin
Copy link
Contributor

I don't think the best solution is to use an external engine here. We have spent effort and time to use Django itself for the search in the documentation and remove a lot of issues from the elastic search synchronization.

This is a bit of a sunk cost fallacy. This issue has existed for many years and it's still not solved, and it's not exactly a minor issue.

I still think that the Django website is also a showcase to demonstrate its potential as a web framework, and using an external search engine would be like admitting that Django's full-text search is not good enough to be used in a web portal.

Django is a comprehensive framework. Django provides the tools to build a full-text search solution, but in case like the documentation, which is a pretty complicated one, we don't seem to have the resources to do it.

Which doesn't mean it cannot be done in Django, but there are specific issues that need a custom implementation for each use case:

  • Ranking - For example, as Thibaud mentioned, deprioritizing things like release notes
  • Section indexing
  • Frontend UI/UX - This is something that is not Django's job to solve

Can all of these things be solved with Django's tools? Yes! Doesn't mean that with the limited amount of resources available, we should dedicate them to build and maintain a fully comprehensive full-text search that is as good or close to it as Algolia or another third party option.
Another example, you can also build a forum in Django, but there was a decision to use Discourse because of the amount of resources/energy/time required. This doesn't mean Django is not good enough to build a forum with.

Basically, maybe we should actually consider Algolia or a third party solution rather than building our own.

@thibaudcolas
Copy link
Member

I shared my trial so we could compare another implementation, I wouldn’t recommend anyone considers another search engine at this point in time. Once the proposed website working group is up and running, we can ask them whether they’d consider such a big shift, and if so review multiple options, and if not make a plan to fix those long-standing search-related UX issues.

@pauloxnet with the current engine – what do you think of implementing type-ahead search, and changing the index so each entry is a section of a page, rather than the whole page?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants