Skip to content

Releases: deepset-ai/haystack

v2.0.0-beta.6

05 Feb 15:51
c3a9dac
Compare
Choose a tag to compare
v2.0.0-beta.6 Pre-release
Pre-release

Release Notes

v2.0.0-beta.6

⬆️ Upgrade Notes

  • Upgraded the default converter in PyPDFToDocument to insert page breaks "f" between each extracted page. This allows for downstream components and applications to better be able to keep track of the original PDF page a portion of text comes from.

  • ⚠️ Breaking change: Update secret handling for components using the Secret type. The following components are affected: RemoteWhisperTranscriber, AzureOCRDocumentConverter, AzureOpenAIDocumentEmbedder, AzureOpenAITextEmbedder, HuggingFaceTEIDocumentEmbedder, HuggingFaceTEITextEmbedder, OpenAIDocumentEmbedder, SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, AzureOpenAIGenerator, AzureOpenAIChatGenerator, HuggingFaceLocalChatGenerator, HuggingFaceTGIChatGenerator, OpenAIChatGenerator, HuggingFaceLocalGenerator, HuggingFaceTGIGenerator, OpenAIGenerator, TransformersSimilarityRanker, SearchApiWebSearch, SerperDevWebSearch

    The default init parameters for api_key, token, azure_ad_token have been adjusted to use environment variables wherever possible. The azure_ad_token_provider parameter has been removed from Azure-based components. Components based on Hugging Face are now required to either use a token or an environment variable if authentication is required - The on-disk local token file is no longer supported.

Required actions to take:
To make fixes to accommodate to this breaking change check the expected environment variable name for the api_key of the affected component you are using. Make sure to provide your API keys via this environment variable. Alternatively, if that's not an option, use the Secret.from_token function to wrap any bare/string API tokens. Mind that pipelines using token secrets cannot be serialized/deserialized.

🚀 New Features

  • Expose a Secret type to provide consistent API for any component that requires secrets for authentication. Currently supports string tokens and environment variables. Token-based secrets are automatically prevented from being serialized to disk (to prevent accidental leakage of secrets).

    from haystack.utils import Secret
    
    @component
    class MyComponent:
      def __init__(self, api_key: Optional[Secret] = None, **kwargs):
        self.api_key = api_key
        self.backend = None
    
      def warm_up(self):
        # Call resolve_value to yield a single result. The semantics of the result is policy-dependent.
        # Currently, all supported policies will return a single string token.
        self.backend = SomeBackend(api_key=self.api_key.resolve_value() if self.api_key else None, ...)
    
      def to_dict(self):
        # Serialize the policy like any other (custom) data. If the policy is token-based, it will
        # raise an error.
        return default_to_dict(self, api_key=self.api_key.to_dict() if self.api_key else None, ...)
    
      @classmethod
      def from_dict(cls, data):
        # Deserialize the policy data before passing it to the generic from_dict function.
        api_key_data = data["init_parameters"]["api_key"]
        api_key = Secret.from_dict(api_key_data) if api_key_data is not None else None
        data["init_parameters"]["api_key"] = api_key
        return default_from_dict(cls, data)
    
    # No authentication.
    component = MyComponent(api_key=None)
    # Token based authentication
    component = MyComponent(api_key=Secret.from_token("sk-randomAPIkeyasdsa32ekasd32e"))
    component.to_dict() # Error! Can't serialize authentication tokens
    # Environment variable based authentication
    component = MyComponent(api_key=Secret.from_env("OPENAI_API_KEY"))
    component.to_dict() # This is fine
  • Adds support for the Exact Match metric to EvaluationResult.calculate_metrics(...):

    from haystack.evaluation.metrics import Metric 
    exact_match_metric = eval_result.calculate_metrics(Metric.EM, output_key="answers")
  • Adds support for the F1 metric to EvaluationResult.calculate_metrics(...):

    from haystack.evaluation.metrics import Metric 
    f1_metric = eval_result.calculate_metrics(Metric.F1, output_key="answers")
  • Adds support for the Semantic Answer Similarity (SAS) metric to EvaluationResult.calculate_metrics(...):

    from haystack.evaluation.metrics import Metric 
    sas_metric = eval_result.calculate_metrics(     
        Metric.SAS, output_key="answers", model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2" )
  • Introducing the HuggingFaceLocalChatGenerator, a new chat-based generator designed for leveraging chat models from Hugging Face's (HF) model hub. Users can now perform inference with chat-based models in a local runtime, utilizing familiar HF generation parameters, stop words, and even employing custom chat templates for custom message formatting. This component also supports streaming responses and is optimized for compatibility with a variety of devices.

    Here is an example of how to use the HuggingFaceLocalChatGenerator:

    from haystack.components.generators.chat import HuggingFaceLocalChatGenerator
    from haystack.dataclasses import ChatMessage
    
    generator = HuggingFaceLocalChatGenerator(model="HuggingFaceH4/zephyr-7b-beta")
    generator.warm_up()
    messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")] 
    print(generator.run(messages))

⚡️ Enhancement Notes

  • Change Pipeline.add_component() to fail if the Component instance has already been added in another Pipeline.
  • Use device_map when loading a TransformersSimilarityRanker and ExtractiveReader. This allows for multi-device inference and for loading quantized models (e.g. load_in_8bit=True)
  • Add meta parameter to ByteStream.from_file_path() and ByteStream.from_string().
  • Add query and document prefix options for the TransformerSimilarityRanker
  • The default in default_streaming_callback was confusing, this function was the go-to-helper one would use to quickly print the generated tokens as they come, but it was not used by default. The function was then renamed to print_streaming_chunk.
  • Speed up import of Document dataclass. Importing Document was slowed down cause we were importing the whole pandas and numpy packages. This has now been changed to import only the necessary classes and functions.
  • Introduces weighted score normalization for the DocumentJoiner's reciprocal rank fusion, enhancing the relevance of document sorting by allowing customizable influence on the final scores

🐛 Bug Fixes

  • Fix auto-complete never working for any Component
  • Fix Haystack imports failing when using local development environment that doesn't have haystack-ai installed.
  • Remove all mentions of Canals by renaming some variables. __canals_input__ and __canals_ouput__ have been renamed respectively to __haystack_input__ and __haystack_ouput__. CANALS_VARIADIC_ANNOTATION has been renamed to HAYSTACK_VARIADIC_ANNOTATION and it's value changed from __canals__variadic_t to __haystack__variadic_t. Default Pipeline debug_path has been changed from .canals_debug to .haystack_debug.

v1.24.0

25 Jan 16:17
Compare
Choose a tag to compare

Release Notes

Highlights

🪨 Amazon Bedrock supports new embedding models (#6406)

You can now use Titan and Cohere embedding models in your pipelines via the Amazon Bedrock integration.

  from haystack.nodes import EmbeddingRetriever

  retriever = EmbeddingRetriever(
      embedding_model="amazon.titan-embed-text-v1",
      document_store=document_store,
      aws_config = {"aws_access_key_id": "ACCESS_KEY",
                    "aws_secret_access_key": "SECRET_KEY",
                    "aws_session_token": "SESSION_TOKEN"})

🕸️ Use any WebDriver you want in Crawler (#5465)

The WebDriver that powers Haystack's crawler is no longer limited to Chrome.
Now you can configure it to use whatever WebDriver you'd like.
See our Crawler docs for more info.

v1.24.0

🚀 New Features

  • Adding Bedrock Embeddings Encoder to use as a retriever.
  • Add an optional webdriver parameter to Crawler. This allows using a pre-configured custom webdriver instead of creating the default Chrome webdriver.

⚡️ Enhancement Notes

  • Add model_kwargs to FARMReader to allow loading in fp16 at inference time
  • Make JoinDocuments sensitive to weights parameter when join_mode is reciprocal rank fusion. Add score normalization for JoinDocuments when join_mode is reciprocal rank fusion.
  • Optimize documents upsert in PineconeDocumentStore (write_documents) by enabling asynchronous requests.
  • Add model_kwargs argument to SentenceTransformersRanker to be able to pass through HF transformers loading options
  • Use batching in the predict method since multiple documents are usually passed at inference time. Allow the model to be loaded in torch.float16 by adding pipeline_kwargs to the init method
  • Correctly calculate the max token limit for gpt-3.5-turbo-1106

🐛 Bug Fixes

  • Correctly calculate the answer page number for Extractive Answers
  • Fixed a bug that caused the EmbeddingRetriever to return no documents when used with a MongoDBAtlasDocumentStore. MongoDBAtlasDocumentStore now accepts a vector_search_index parameter, which needs to be created before in the MongoDB Atlas Web UI following their documentation.

v1.24.0-rc1

24 Jan 17:17
0c0d538
Compare
Choose a tag to compare
v1.24.0-rc1 Pre-release
Pre-release

v1.24.0-rc1

v2.0.0-beta.5

17 Jan 16:30
d1bdb8c
Compare
Choose a tag to compare
v2.0.0-beta.5 Pre-release
Pre-release

Release Notes

v2.0.0-beta.5

⬆️ Upgrade Notes

  • Implement framework-agnostic device representations. The main impetus behind this change is to move away from stringified representations of devices that are not portable between different frameworks. It also enables support for multi-device inference in a generic manner.

    Going forward, components can expose a single, optional device parameter in their constructor (Optional[ComponentDevice]):

import haystack.utils import ComponentDevice, Device, DeviceMap 

class MyComponent(Component):     
	def __init__(self, device: Optional[ComponentDevice] = None):         
		# If device is None, automatically select a device.         
		self.device = ComponentDevice.resolve_device(device)
    
	def warm_up(self):         
		# Call the framework-specific conversion method.         
		self.model = AutoModel.from_pretrained("deepset/bert-base-cased-squad2", device=self.device.to_hf())  

# Automatically selects a device. 
c = MyComponent(device=None) 
# Uses the first GPU available. 
c = MyComponent(device=ComponentDevice.from_str("cuda:0")) 
# Uses the CPU. 
c = MyComponent(device=ComponentDevice.from_single(Device.cpu())) 
# Allow the component to use multiple devices using a device map. 
c = MyComponent(device=ComponentDevice.from_multiple(
  DeviceMap({       
    "layer1": Device.cpu(),       
    "layer2": Device.gpu(1),       
    "layer3": Device.disk() 
  })
))
  • Change any occurrence of:
    from haystack.components.routers.document_joiner import DocumentJoiner

    to:
    from haystack.components.joiners.document_joiner import DocumentJoiner

  • Change the imports for in_memory document store and retrievers from:

    from haystack.document_stores import InMemoryDocumentStore from haystack.components.retrievers import InMemoryEmbeddingRetriever

    to:

    from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

  • Rename the transcriber parametersmodel_name and model_name_or_path to model. This change affects both LocalWhisperTranscriber and RemoteWhisperTranscriber classes.

  • Rename the embedder parameters model_name and model_name_or_path tomodel. This change affects all Embedder classes.

  • Rename model_name_or_path to model in NamedEntityExtractor.

  • Rename model_name_or_path to model in TransformersSimilarityRanker.

  • Rename parametermodel_name_or_pathtomodel inExtractiveReader.

  • Rename the generator parameters model_name and model_name_or_path to model. This change affects all Generator classes.

🚀 New Features

  • Adds calculate_metrics() function to EvaluationResult for computation of evaluation metrics. Adds Metric class to store list of available metrics. Adds MetricsResult class to store the metric values computed during the evaluation.

  • Added a new extractor component, namely NamedEntityExtractor. This component accepts a list of Documents as its input - the raw text in the documents are annotated by the extractor and the annotations are stored in the document's meta dictionary (under the key named_entities).
    The component is designed to support multiple NER backends, and the current implementations support two at the moment: Hugging Face and spaCy. These two backends implement support for any HF/spaCy model that supports token classification/NER respectively.

  • Add `component.set_input_type() function to set a Component input name, type and default value.

  • Adds support for single metadata dictionary input in MarkdownToDocument.

  • Adds support for single metadata dictionary input in TikaDocumentConverter.

⚡️ Enhancement Notes

  • Add a field called default_value to the InputSocket dataclass. Deriveis_mandatory value from the presence of default_value.
  • Added split_by "page" to DocumentSplitter, which will split the document at "\f"
  • Modify the output type of CacheChecker from List[Any] to Listto make it possible to connect it in a Pipeline.
  • Highlight optional connections in thePipeline.draw() output.
  • Improve the URLCacheChecker so that it can work with any type of data in the DocumentStore, not just URL caching. Rename the component to CacheChecker.
  • Prevent the MetaFieldRanker from throwing an error if one or more of the documents doesn't contain the specific meta data field. Now those documents will be ignored for ranking purposes and placed at the end of the ranked list so we don't completely throw them away. Adding a sort_order that can have values of descending or ascending. Added more runtime parameters.
  • Create a new package called joiners and move DocumentJoiner there for clarity.
  • Stop exposing in_memory package symbols in the haystack.document_store and <shaystack.components.retrievers root namespaces.
  • Add example script about how to use Multiplexer to route meta to file converters.
  • Adds support for single metadata dictionary input in AzureOCRDocumentConverter. In this way, additional metadata can be added to all files processed by this component even when the length of the list of sources is unknown.

🐛 Bug Fixes

  • Fix ComponentMeta ignoring keyword-only parameters in the run method. ComponentMeta.__call__ handles the creation of InputSockets for the component's inputs when the latter has not explicitly called _Component.set_input_types(). This logic was not correctly handling keyword-only parameters.
  • Fixes the error descriptor '__dict__' for 'ComponentClassX' objects doesn't apply to a 'ComponentClassX' object when calling dir() on a component instance. This fix should allow auto-completion in code editors.
  • Prevent InMemoryBM25Retriever from returning documents with a score of 0.0.
  • Fix pytest breaking in VSCode due to a name collision in the RAG pipeline tests.
  • Correctly handle the serialization and deserialization of torch.dtype. This concerns the following components: ExtractiveReader, HuggingFaceLocalGenerator, and TransformersSimilarityRanker.

v2.0.0-beta.4

08 Jan 11:30
ae96c2e
Compare
Choose a tag to compare
v2.0.0-beta.4 Pre-release
Pre-release

Release Notes

v2.0.0-beta.4

⬆️ Upgrade Notes

  • If you have a LocalWhisperTranscriber in a pipeline, change the audio_files input name to sources. Similarly for standalone invocation of the component, pass sources instead of audio_files to the run() method.

🚀 New Features

  • Add HuggingFace TEI Embedders - HuggingFaceTEITextEmbedder and HuggingFaceTEIDocumentEmbedder.

    An example using HuggingFaceTEITextEmbedder to embed a string:

    from haystack.components.embedders import HuggingFaceTEITextEmbedder 
    
    text_to_embed = "I love pizza!" 
    text_embedder = HuggingFaceTEITextEmbedder(model="BAAI/bge-small-en-v1.5", url="<your-tei-endpoint-url>", token="<your-token>" ) print(text_embedder.run(text_to_embed)) 
    # {'embedding': [0.017020374536514282, -0.023255806416273117, ...] 

    An example using HuggingFaceTEIDocumentEmbedder to create Document embeddings:

    from haystack.dataclasses import Document 
    from haystack.components.embedders import HuggingFaceTEIDocumentEmbedder 
    
    doc = Document(content="I love pizza!") 
    document_embedder = HuggingFaceTEIDocumentEmbedder( model="BAAI/bge-small-en-v1.5", url="<your-tei-endpoint-url>", token="<your-token>" ) 
    result = document_embedder.run([doc]) 
    print(result["documents"][0].embedding) 
    # [0.017020374536514282, -0.023255806416273117, ...] 
  • Adds AzureOpenAIDocumentEmbedder and AzureOpenAITextEmbedder as new embedders. These embedders are very similar to their OpenAI counterparts, but they use the Azure API instead of the OpenAI API.

  • Adds support for Azure OpenAI models with AzureOpenAIGenerator and AzureOpenAIChatGenerator components.

  • Adds RAG OpenAPI services integration.

  • Introduces answer deduplication on the Document level based on an overlap threshold.

  • Add Multiplexer. For an example of its usage, see #6420.

  • Adds support for single metadata dictionary input in TextFileToDocument`.

⚡️ Enhancement Notes

  • Add support for ByteStream to LocalWhisperTranscriber and uniform the input socket names to the other components in Haystack.
  • Rename metadata to meta. Rename metadata_fields_to_embed to meta_fields_to_embed in all Embedders. Rename metadata_field to meta_field in MetaFieldRanker.
  • Rename all metadata references to meta.
  • Change DocumentWriter default policy from DuplicatePolicy.FAIL to DuplicatePolicy.NONE. The DocumentStore protocol uses the same default so that different Document Stores can choose the default policy that better fit.
  • Move serialize_type and deserialize_type in the utils module.
  • The HTMLToDocument converter now allows choosing the boilerpy3 extractor to extract the content from the HTML document. The default extractor has been changed to DefaultExtractor, which is better for generic use cases than the previous default (ArticleExtractor).
  • Adds scale_score, which allows users to toggle if they would like their document scores to be raw logits or scaled between 0 and 1 (using the sigmoid function). This is a feature that already existed in Haystack v1 that is being moved over. Adds calibration_factor. This follows the example from the ExtractiveReader which allows the user to better control the spread of scores when scaling the score using sigmoid. Adds score_threshold. Also copied from the ExtractiveReader. This optionally allows users to set a score threshold where only documents with a score above this threshold are returned.
  • Add RAG self correction loop example
  • Adds support for single metadata dictionary input in HTMLToDocument.
  • Adds support for single metadata dictionary input in PyPDFToDocument.
  • Split DynamicPromptBuilder into DynamicPromptBuilder and DynamicChatPromptBuilder
  • Depend on our own rank_bm25 fork.
  • Add meta_fields_to_embed following the implementation in SentenceTransformersDocumentEmbedder to be able to embed meta fields along with the content of the document.
  • Add new variable model_kwargs to the TransformersSimilarityRanker so we can pass different loading options supported by HuggingFace. Add device availability checking if the user passes in None to the device init param. Ranking goes, GPU, MPS, CPU.
  • Update OpenAIChatGenerator to handle both tools and functions calling. OpenAIChatGenerator now supports both tools and functions generation_kwargs parameters that enable function/tools invocation.
  • Upgrade to OpenAI client version 1.x

⚠️ Deprecation Notes

  • Deprecate GPTGenerator and GPTChatGenerator. Replace them with OpenAIGenerator and OpenAIChatGenerator.

🐛 Bug Fixes

  • Fix Pipeline.connect() so it connects sockets with same name if multiple sockets with compatible types are found.

v2.0.0-beta.3

15 Dec 16:55
4fdbcfa
Compare
Choose a tag to compare
v2.0.0-beta.3 Pre-release
Pre-release

Release Notes

v2.0.0-beta.3

⬆️ Upgrade Notes

  • If you are using AzureOCRDocumentConverter or TikaDocumentConverter, you need to change paths to sources in the run method.

    An example: `python from haystack.components.converters import TikaDocumentConverter converter = TikaDocumentConverter() converter.run(paths=["paths/to/file1.pdf", "path/to/file2.pdf"])`

    The last line should be changed to: `python converter.run(sources=["paths/to/file1.pdf", "path/to/file2.pdf"])`

⚡️ Enhancement Notes

  • Adds markdown mimetype support to the file type router i.e. FileTypeRouter class.

  • Refactor Answer dataclass and classes that inherited it. Now Answer is a Protocol, classes that used to inherit it now respect that interface. We also added a new ExtractiveTableAnswer to be used for table question answering.

    All classes now are easily serializable using to_dict() and from_dict() like Document and components.

  • Make all Converters accept meta in the run method, so that users can provide their own metadata. The length of this list should match the number of sources.

  • Make all the Converters accept the sources parameter in the run method. sources is a list that can contain str, Path or ByteStream objects.

  • Renamed the confidence_threshold parameter of the ExtractiveReader to score_threshold as ExtractedAnswers have a score and this is what the threshold is for. For consistency, the term confidence is not mentioned anymore in favor of score.

  • Include 'boilerpy3' in the 'haystack-ai' dependencies.

Known Issues

  • Make connect idempotent, allowing connecting the same components more than once. Specially useful in Jupiter notebooks. Fixes #6359.
  • Fix "TypeError: descriptor '__dict__' for 'XXX' objects doesn't apply to a 'XXX' object" when running pipelines with debug=True by removing the graph image from the debug payload.

🐛 Bug Fixes

  • Make TransformersSimilarityRanker run with a list containing a single document as input.

v.1.23.0

14 Dec 13:35
Compare
Choose a tag to compare

⭐️ Highlights

🪨 Amazon Bedrock support for PromptNode (#6226)

Haystack now supports Amazon Bedrock models, including all existing and previously announced
models, like Llama-2-70b-chat. To use these models, simply pass the model ID in the
model_name_or_path parameter, like you do for any other model. For details, see
Amazon Bedrock Documentation.

For example, the following code loads the Llama 2 Chat 13B model:

from haystack.nodes import PromptNode

prompt_node = PromptNode(model_name_or_path="meta.llama2-13b-chat-v1")

🗺️ Support for MongoDB Atlas Document Store (#6471)

With this release, we introduce support for MongoDB Atlas as a Document Store. Try it with:

from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=f"mongodb+srv://USER:PASSWORD@HOST/?{'retryWrites': 'true', 'w': 'majority'}",
    database_name="database",
    collection_name="collection",
)
...
document_store.write_documents(...)

Note that you need MongoDB Atlas credentials to fill the connection string. You can get such credentials by registering here: https://www.mongodb.com/cloud/atlas/register

⬆️ Upgrade Notes

🚀 New Features

  • Add PptxConverter: a node to convert pptx files to Haystack Documents.

  • Add split_length by token in PreProcessor.

  • Support for dense embedding instructions used in retrieval models such as BGE and LLM-Embedder.

  • You can use Amazon Bedrock models in Haystack.

  • Add MongoDBAtlasDocumentStore, providing support for MongoDB Atlas as a document store.

⚡️ Enhancement Notes

  • Change PromptModel constructor parameter invocation_layer_class to accept a str too.
    If a str is used the invocation layer class will be imported and used.
    This should ease serialisation to YAML when using invocation_layer_class with PromptModel.

  • Users can now define the number of pods and pod type directly when creating a PineconeDocumentStore instance.

  • Add batch_size to the init method of FAISS Document Store. This works as the default value for all methods of
    FAISS Document Store that support batch_size.

  • Introduces a new timeout keyword argument in PromptNode, addressing and fixing the issue #5380 for enhanced control over individual calls to OpenAI.

  • Upgrade Transformers to the latest version 4.35.2
    This version adds support for DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2.

  • Upgraded openai-whisper to version 20231106 and simplified installation through re-introduced audio extra.
    The latest openai-whisper version unpins its tiktoken dependency, which resolved a version conflict with Haystack's dependencies.

  • Make it possible to load additional fields from the SQUAD format file into the meta field of the Labels.

  • Add new variable model_kwargs to the ExtractiveReader so we can pass different loading options supported by
    HuggingFace.

  • Add new token limit for gpt-4-1106-preview model.

🐛 Bug Fixes

  • Fix Pipeline.load_from_deepset_cloud to work with the latest version of deepset Cloud.

  • When using JoinDocuments with join_mode=concatenate (default) and
    passing duplicate documents, including some with a null score, this
    node raised an exception.
    Now this case is handled correctly and the documents are joined as expected.

  • Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus
    ensures that they are included in JSON schema generation.

  • Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus
    ensures that they are included in JSON schema generation.

v2.0.0-beta.1

04 Dec 15:20
b25e5e8
Compare
Choose a tag to compare
v2.0.0-beta.1 Pre-release
Pre-release

Introduction

We are happy to officially share Haystack 2.0-beta with you. The new version is a complete rework of the pipeline, our core concept, with production readiness, ease of use, and customizability in mind.

Haystack 2.0-Beta Documentation.
Check the available features in this Beta release (see section below).
Try out Haystack 2.0-Beta in “Advent of Code”.

What does the “Beta” mean for me?

Production readiness means also caring about stability. Therefore, we decided to release a beta version now and test it thoroughly in public over the next weeks. We will add more features and we might add breaking changes until the stable 2.0 release in late Q1 2024.

We invite you to try this beta version and give candid feedback, it will be heard and we will change Haystack accordingly. We’ve put together 10 code challenges for you in our “Advent of Haystack” to get your hands on it. We don’t recommend migrating your production pipelines yet to 2.0 beta.

We will support Haystack 1.x with updates and important features being added to the codebase even after the final 2.0.0 release, to give users time to migrate.

⭐️ What’s changed?

For a detailed overview of what’s changed in this Beta release, check out our article “Introducing Haystack 2.0 and Advent of Haystack”.

The bulk of the work in this release introduces changes to the fundamental design of:

In the last few months, we've been working with our community members and partners to already start adding some integrations for Haystack 2.0. Today, along with the beta package you can also try integrations tagged with Haystack 2.0 in our Integration inventory!

🚀 Getting started

One way to get started with Haystack 2.0 Beta is to participate in the “Advent of Haystack” and give us feedback on how you got along.

To install the new package:

pip install haystack-ai

To use a simple RAG pipeline:

from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.pipeline_utils import build_rag_pipeline

API_KEY = "sk-xxx" # ADD YOUR OPENAI API KEY

# We support many different databases. Here we load a simple and lightweight in-memory document store.
document_store = InMemoryDocumentStore()

# Create some example documents and add them to the document store.
documents = [
    Document(content="My name is Jean and I live in Paris."),
    Document(content="My name is Mark and I live in Berlin."),
    Document(content="My name is Giorgio and I live in Rome."),
]
document_store.write_documents(documents)

# Let's now build a simple RAG pipeline that uses a generative model to answer questions.
rag_pipeline = build_rag_pipeline(llm_api_key=API_KEY, document_store=document_store)
answers = rag_pipeline.run(query="Who lives in Rome?")
print(answers.data)

For more details on how to get started see: https://docs.haystack.deepset.ai/v2.0/docs/get_started

🪶 List of Features

✅ Ready in this Beta release

🏗️ Under construction

Feature Haystack 2.0-Beta
Document Stores  
InMemoryDocumentStore
ElasticsearchDocumentstore
OpenSearchDocumentStore
ChromaDocumentStore
MarqoDocumentStore
FAISSDocumentStore 🏗️
PineconeDocumentStore 🏗️
WeaviateDocumentStore 🏗️
MilvusDocumentStore 🏗️
QdrantDocumentStore 🏗️
PGVectorDocumentStore 🏗️
MongoDBAtlasDocumentStore 🏗️
   
Generators  
GPTGenerator
HuggingFaceLocalGenerator
HuggingFaceTGIGenerator
GradientGenerator
Anthropic - Claude 🏗️
Cohere - generate
AzureGPT 🏗️
AWS Bedrock 🏗️
AWS SageMaker 🏗️
PromptNode 🏗️
PromptBuilder
AnswerBuilder
   
Embedders  
OpenAI Embedder
SentenceTransformers Embedder
Cohere - embed 🏗️
Gradient Embedder (external)
   
Retrievers  
InMemoryBM25Retriever
InMemoryEmbeddingRetriever
ElasticsearchBM25Retriever
ElasticsearchEmbeddingRetriever
OpensearchBM25Retriever
OpensearchEmbeddingRetriever
SerperDevWebSearch
MultiModalRetriever 🏗️
TableTextRetriever 🏗️
DensePassageRetriever 🏗️
   
Rankers  
TransformersSimilarityRanker
CohereRanker 🏗️
DiversityRanker 🏗️
LostInTheMiddleRanker 🏗️
RecentnessRanker 🏗️
MetaFieldRanker
   
Readers  
ExtractiveReader  
(successor of both FARMReader and TransformersReader)
TableReader 🏗️
   
Data Processing  
Local + Remote WhisperTranscriber
UrlCacheChecker
LinkContentFetcher
AzureOCRDocumentConverter
HTMLToDocument
PyPDFToDocument
TikaDocumentConverter
TextFileToDocument
MarkdownToDocument
DocumentCleaner
TextDocumentSplitter
TextLanguageClassifier
FileTypeRouter
MetadataRouter
DocumentWriter
DocumentJoiner
   
Misc  
Evaluation 🏗️
Agents 🏗️
Conversational Agent 🏗️
TopPSampler
TransformersSummarizer 🏗️
TransformersTranslator 🏗️

v1.22.1

09 Nov 16:44
d804ac6
Compare
Choose a tag to compare

Release Notes

v1.22.1

Enhancement Notes

  • Add new token limit for gpt-4-1106-preview model

Bug Fixes

  • When using JoinDocuments with join_mode=concatenate (default) and passing duplicate documents, including some with a null score, this node raised an exception. Now this case is handled correctly and the documents are joined as expected.

v1.22.0

07 Nov 15:02
58fa94c
Compare
Choose a tag to compare

Release Notes

v1.22.0

⭐️ Highlights

Some additions to Haystack 2.0 preview:

New additions include a ByteStream type for binary data abstraction and the ChatMessage data class to streamline chat LLM component integration. AzureOCRDocumentConverter, HTMLToDocument and PyPDFToDocument further expand capability in document conversion. TransformersSimilarityRanker and TopPSampler improve document ranking and query handling capabilities. HuggingFaceLocalGenerator adds to ever-growing LLM components. These significant updates, along with a host of minor fixes and refinements, mark a significant step towards the upcoming Haystack 2.0 beta release.

⬆️ Upgrade Notes

  • This update enables all Pinecone index types to be used, including Starter. Previously, Pinecone Starter index type couldn't be used as document store. Due to limitations of this index type (https://docs.pinecone.io/docs/starter-environment), in current implementation fetching documents is limited to Pinecone query vector limit (10000 vectors). Accordingly, if the number of documents in the index is above this limit, some of PineconeDocumentStore functions will be limited.
  • Removes the audio, ray, onnx and beir extras from the extra group all.

🚀 New Features

  • Add experimental support for asynchronous Pipeline run

⚡️ Enhancement Notes

  • Added support for Apple Silicon GPU acceleration through "mps pytorch", enabling better performance on Apple M1 hardware.
  • Document writer returns the number of documents written.
  • added support for using on_final_answer trough Agent callback_manager
  • Add asyncio support to the OpenAI invocation layer.
  • PromptNode can now be run asynchronously by calling the arun method.
  • Add search_engine_kwargs param to WebRetriever so it can be propagated to WebSearch. This is useful, for example, to pass the engine id when using Google Custom Search.
  • Upgrade Transformers to the latest version 4.34.1. This version adds support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat models.
  • Make JoinDocuments return only the document with the highest score if there are duplicate documents in the list.
  • Add list_of_paths argument to utils.convert_files_to_docs to allow input of list of file paths to be converted, instead of, or as well as, the current dir_path argument.
  • Optimize particular methods from PineconeDocumentStore (delete_documents and _get_vector_count)
  • Update the deepset Cloud SDK to the new endpoint format for new saving pipeline configs.
  • Add alias names for Cohere embed models for an easier map between names

⚠️ Deprecation Notes

  • Deprecate OpenAIAnswerGenerator in favor of PromptNode. OpenAIAnswerGenerator will be removed in Haystack 1.23.

🐛 Bug Fixes

  • Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus ensures that they are included in JSON schema generation.
  • Fixed the bug that prevented the correct usage of ChatGPT invocation layer in 1.21.1. Added async support for ChatGPT invocation layer.
  • Added documents_store.update_embeddings call to pipeline examples so that embeddings are calculated for newly added documents.
  • Remove unsupported medium and finance-sentiment models from supported Cohere embed model list

🩵 Haystack 2.0 preview

  • Add AzureOCRDocumentConverter to convert files of different types using Azure's Document Intelligence Service.
  • Add ByteStream type to send binary raw data across components in a pipeline.
  • Introduce ChatMessage data class to facilitate structured handling and processing of message content within LLM chat interactions.
  • Adds ChatMessage templating in PromptBuilder
  • Adds HTMLToDocument component to convert HTML to a Document.
  • Adds SimilarityRanker, a component that ranks a list of Documents based on their similarity to the query.
  • Introduce the StreamingChunk dataclass for efficiently handling chunks of data streamed from a language model, encapsulating both the content and associated metadata for systematic processing.
  • Adds TopPSampler, a component selects documents based on the cumulative probability of the Document scores using top p (nucleus) sampling.
  • Add dumps, dump, loads and load methods to save and load pipelines in Yaml format.
  • Adopt Hugging Face token instead of the deprecated use_auth_token. Add this parameter to ExtractiveReader and SimilarityRanker to allow loading private models. Proper handling of token during serialization: if it is a string (a possible valid token) it is not serialized.
  • Add mime_type field to ByteStream dataclass.
  • The Document dataclass checks if id_hash_keys is None or empty in __post_init__. If so, it uses the default factory to set a default valid value.
  • Rework Document.id generation, if an id is not explicitly set it's generated using all Document field' values, score is not used.
  • Change Document's embedding field type from numpy.ndarray to List[float]
  • Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to ignore id_hash_keys and create Documents with duplicate ids if the documents differed only in their metadata.
  • Fix TextDocumentSplitter failing when run with an empty list
  • Better management of API key in GPT Generator. The API key is never serialized. Make the api_base_url parameter really used (previously it was ignored).
  • Add a minimal version of HuggingFaceLocalGenerator, a component that can run Hugging Face models locally to generate text.
  • Migrate RemoteWhisperTranscriber to OpenAI SDK.
  • Add OpenAI Document Embedder. It computes embeddings of Documents using OpenAI models. The embedding of each Document is stored in the embedding field of the Document.
  • Add the TextDocumentSplitter component for Haystack 2.0 that splits a Document with long text into multiple Documents with shorter texts. Thereby the texts match the maximum length that the language models in Embedders or other components can process.
  • Refactor OpenAIDocumentEmbedder to enrich documents with embeddings instead of recreating them.
  • Refactor SentenceTransformersDocumentEmbedder to enrich documents with embeddings instead of recreating them.
  • Remove "api_key" from serialization of AzureOCRDocumentConverter and SerperDevWebSearch.
  • Removed implementations of from_dict and to_dict from all components where they had the same effect as the default implementation from Canals: https://github.com/deepset-ai/canals/blob/main/canals/serialization.py#L12-L13 This refactoring does not change the behavior of the components.
  • Remove array field from Document dataclass.
  • Remove id_hash_keys field from Document dataclass. id_hash_keys has been also removed from Components that were using it:
    • DocumentCleaner
    • TextDocumentSplitter
    • PyPDFToDocument
    • AzureOCRDocumentConverter
    • HTMLToDocument
    • TextFileToDocument
    • TikaDocumentConverter
  • Enhanced file routing capabilities with the introduction of ByteStream handling, and improved clarity by renaming the router to FileTypeRouter.
  • Rename MemoryDocumentStore to InMemoryDocumentStore Rename MemoryBM25Retriever to InMemoryBM25Retriever Rename MemoryEmbeddingRetriever to InMemoryEmbeddingRetriever
  • Renamed ExtractiveReader's input from document to documents to match its type List[Document].
  • Rename SimilarityRanker to TransformersSimilarityRanker, as there will be more similarity rankers in the future.
  • Allow specifying stopwords to stop text generation for HuggingFaceLocalGenerator.
  • Add basic telemetry to Haystack 2.0 pipelines
  • Added DocumentCleaner, which removes extra whitespace, empty lines, headers, etc. from Documents containing text. Useful as a preprocessing step before splitting...
Read more