v1.20.0-rc1
Pre-release⭐ Highlights
🪄LostInTheMiddleRanker and DiversityRanker
We are excited to introduce two new rankers to Haystack: LostInTheMiddleRanker
and DiversityRanker
!
LostInTheMiddleRanker
is based on the research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. It reorders documents according to the "Lost in the Middle" strategy, which places the most relevant paragraphs at the beginning and end of the context, while less relevant paragraphs are positioned in the middle. This ranker can be used in Retrieval-Augmented Generation (RAG) pipelines. Here is an example of how to use it:
web_retriever = WebRetriever(api_key=search_key, top_search_results=5, mode="preprocessed_documents", top_k=50)
sampler = TopPSampler(top_p=0.97)
diversity_ranker = DiversityRanker()
litm_ranker = LostInTheMiddleRanker(word_count_threshold=1024)
pipeline = Pipeline()
pipeline.add_node(component=web_retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
pipeline.add_node(component=diversity_ranker, name="DiversityRanker", inputs=["Sampler"])
pipeline.add_node(component=litm_ranker, name="LostInTheMiddleRanker", inputs=["DiversityRanker"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["LostInTheMiddleRanker"])
In this example, we have positioned the LostInTheMiddleRanker as the last component before the PromptNode. This is because the LostInTheMiddleRanker is designed to be used in combination with other rankers. It is recommended to place it towards the end of the pipeline (as the last ranker), so that it can reorder the documents that have already been ranked by other rankers.
DiversityRanker
is a tool that helps to increase the diversity of a set of documents. It uses sentence-transformer models to calculate semantic embeddings for each document and then ranks them in a way that ensures that each subsequent document is the least similar to the ones that have already been selected. This results in a list where each document contributes the most to the overall diversity of the selected set.
We'll reuse the same example from the LostInTheMiddleRanker to point out that the DiversityRanker can be used in combination with other rankers. It is recommended to place it in the pipeline after the similarity ranker but before the LostInTheMiddleRanker. Note that DiversityRanker is typically used in generative RAG pipelines to ensure that the generated answer is drawn from a diverse set of documents. This setup is typical for Long-Form Question Answering (LFQA) tasks. Check out Enhancing RAG Pipelines in Haystack: Introducing DiversityRanker and LostInTheMiddleRanker article on Haystack Blog for details.
📰 New release note management
We have implemented a new release note management system, reno
. From now on, every contributor is responsible for adding release notes for the feature or bugfix they're introducing in Haystack in the same Pull Request containing the code changes. The goal is to encourage detailed and accurate notes for every release, especially when it comes to complex features or breaking changes.
See how to work with the new release notes in our Contribution Guide.
⬆️ Upgrade Notes
-
If you're a Haystack contributor, you need a new tool called
reno
to manage the release notes.
Please runpip install -e .[dev]
to ensure you havereno
available in your environment. -
The Opensearch custom query syntax changes: the old filter placeholders for
custom_query
are no longer supported.
Replace your custom filter expressions with the new${filters}
placeholder:Old:
retriever = BM25Retriever( custom_query=""" { "query": { "bool": { "should": [{"multi_match": { "query": ${query}, "type": "most_fields", "fields": ["content", "title"]}} ], "filter": [ {"terms": {"year": ${years}}}, {"terms": {"quarter": ${quarters}}}, {"range": {"date": {"gte": ${date}}}} ] } } } """ ) retriever.retrieve( query="What is the meaning of life?", filters={"years": [2019, 2020], "quarters": [1, 2, 3], "date": "2019-03-01"} )
New:
retriever = BM25Retriever( custom_query=""" { "query": { "bool": { "should": [{"multi_match": { "query": ${query}, "type": "most_fields", "fields": ["content", "title"]}} ], "filter": ${filters} } } } """ ) retriever.retrieve( query="What is the meaning of life?", filters={"year": [2019, 2020], "quarter": [1, 2, 3], "date": {"$gte": "2019-03-01"}} )
-
This update impacts only those who have created custom invocation layers by subclassing PromptModelInvocationLayer.
Previously, the invoke() method in your custom layer received all prompt template parameters (like query,
documents, etc.) as keyword arguments. With this change, these parameters will no longer be passed in as keyword
arguments. If you've implemented such a custom layer, you'll need to potentially update your code to accommodate
this change.
🥳 New Features
-
The
LostInTheMiddleRanker
can be used like other rankers in Haystack. After initializing LostInTheMiddleRanker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order - the most relevant documents are located at the top and bottom of the returned list, while the least relevant documents are found in the middle. We advise that you use this ranker in combination with other rankers, and to place it towards the end of the pipeline. -
The
DiversityRanker
can be used like other rankers in Haystack and it can be particularly helpful in cases where you have highly relevant yet similar sets of documents. By ensuring a diversity of documents, this new ranker facilitates a more comprehensive utilization of the documents and, particularly in RAG pipelines, potentially contributes to more accurate and rich model responses. -
When using
custom_query
inBM25Retriever
along withOpenSearch
orElasticsearch
, we added support for dynamicfilters
, like in regular queries. With this change, you can pass filters at query-time without having to modify thecustom_query
:
Instead of defining filter expressions and field placeholders, all you have to do is setting the${filters}
placeholder analogous to the${query}
placeholder into yourcustom_query
.
For example:{ "query": { "bool": { "should": [{"multi_match": { "query": ${query}, // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}} ], "filter": ${filters} // optional filters placeholder } } }
-
DeepsetCloudDocumentStore
supports searching multiple fields in sparse queries. This enables you to search meta fields as well when usingBM25Retriever
. For example setsearch_fields=["content", "title"]
to search thetitle
meta field along with the documentcontent
. -
Rework
DocumentWriter
to removeDocumentStoreAwareMixin
. Now we require a genericDocumentStore
when initialisating the writer. -
Rework
MemoryRetriever
to removeDocumentStoreAwareMixin
. Now we require aMemoryDocumentStore
when initialisating the retriever. -
Introduced
allowed_domains
parameter inWebRetriever
for domain-specific searches, thus enabling "talk to a website" and "talk to docs" scenarios.
✨ Enhancements
-
The WebRetriever now employs an enhanced caching mechanism that caches web page content based on search engine results rather than the query.
-
Upgrade transformers to the latest version 4.32.1 so that Haystack benefits from Llama and T5 bugfixes: https://github.com/huggingface/transformers/releases/tag/v4.32.1
-
Upgrade Transformers to the latest version 4.32.0.
This version adds support for the GPTQ quantization and integrates MPT models. -
Add top_k parameter to the DiversityRanker init method.
-
Enable setting the
max_length
value when running PromptNodes using local HF text2text-generation models. -
Enable passing use_fast to the underlying transformers' pipeline
-
Enhance FileTypeClassifier to detect media file types like mp3, mp4, mpeg, m4a, and similar.
-
Minor PromptNode HFLocalInvocationLayer test improvements
-
Several minor enhancements for LinkContentFetcher:
- Dynamic content handler resolution
- Custom User-Agent header (optional, minimize blocking)
- PDF support
- Register new content handlers
-
If LinkContentFetcher encounters a block or receives any response code other than HTTPStatus.OK, return the search engine snippet as content, if it's available.
-
Allow loading Tokenizers for prompt models not natively supported by transformers by setting
trust_remote_code
toTrue
. -
Refactor and simplify WebRetriever to use LinkContentFetcher component
-
Remove template variables from invocation layer kwargs
-
Allow WebRetriever users to specify a custom LinkContentFetcher instance
🐛 Bug Fixes
-
Fix the bug that the responses of Agents using local HF models contain the prompt text.
-
Fix issue 5485, TransformersImageToText.generate_captions accepts "str"
-
Fix StopWordsCriteria not checking stop word tokens in a continuous and sequential order
-
Ensure the leading whitespace in the generated text is preserved when using
stop_words
in the Hugging Face invocation layer of the PromptNode. -
Restricts the criteria for identifying an OpenAI model in the PromptNode and in the EmbeddingRetriever.
Previously, the criteria were quite loose, leading to more false positives. -
Make the Crawler work properly with Selenium>=4.11.0.
Simplify the Crawler, as the new version of Selenium automatically finds or installs the necessary drivers.
👁️ Haystack 2.0 preview
-
Adds
FileExtensionClassifier
to preview components. -
Add
SentenceTransformersDocumentEmbedder
.
It computes embeddings of Documents. The embedding of each Document is stored in theembedding
field of the Document. -
Add
SentenceTransformersTextEmbedder
.
It is a simple component that embeds strings into vectors. -
Add
Answer
base class for Haystack v2 -
Add
GeneratedAnswer
andExtractedAnswer
-
Improve error messaging in the FileExtensionClassifier constructor to avoid common mistakes.
-
Migrate existing v2 components to Canals 0.4.0
-
Fix TextFileToDocument using wrong Document class
-
Change import paths under the "preview" package to minimize module namespace pollution.
-
Migrate all components to Canals==0.7.0
-
Add serialization and deserialization methods for all Haystack components
-
Added new
DocumentWriter
component to Haystack v2 preview so that documents can be written to stores. -
Copy lazy_imports.py to preview
-
Remove
BaseTestComponent
class used to testComponent
s -
Remove
DocumentStoreAwareMixin
as it's not necessary anymore -
Remove Pipeline specialisation to support DocumentStores.
-
Add Sentence Transformers Embedding Backend.
It will be used by Embedder components and is responsible for computing embeddings. -
Add utility function
store_class
factory to createStore
s for testing purposes. -
Add
from_dict
andto_dict
methods toStore
Protocol
-
Add default
from_dict
andto_dict
implementations to classes decorated with@store
-
Add new TextFileToDocument component to Haystack v2 preview so that text files can be converted to Haystack Documents.