Skip to content

v1.22.0-rc1

Pre-release
Pre-release
Compare
Choose a tag to compare
@masci masci released this 30 Oct 14:38
· 1062 commits to main since this release
0fb3b82

v1.22.0-rc1

Upgrade Notes

  • This update enables all Pinecone index types to be used, including
    Starter. Previously, Pinecone Starter index type couldn't be used as
    document store. Due to limitations of this index type
    (https://docs.pinecone.io/docs/starter-environment), in current
    implementation fetching documents is limited to Pinecone query
    vector limit (10000 vectors). Accordingly, if the number of
    documents in the index is above this limit, some of
    PineconeDocumentStore functions will be limited.
  • Removes the audio,
    ray,
    onnx and
    beir extras from the extra group
    all.

New Features

  • Add experimental support for asynchronous
    Pipeline run

Enhancement Notes

  • Added support for Apple Silicon GPU acceleration through "mps
    pytorch", enabling better performance on Apple M1 hardware.
  • Document writer returns the number of documents written.
  • added support for using
    on_final_answer trough
    Agent
    callback_manager
  • Add asyncio support to the OpenAI invocation layer.
  • PromptNode can now be run asynchronously by calling the
    arun method.
  • Add search_engine_kwargs param to
    WebRetriever so it can be propagated to WebSearch. This is useful,
    for example, to pass the engine id when using Google Custom Search.
  • Upgrade Transformers to the latest version 4.34.1. This version adds
    support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat
    models.
  • Make JoinDocuments return only the document with the highest score
    if there are duplicate documents in the list.
  • Add list_of_paths argument to
    utils.convert_files_to_docs to allow
    input of list of file paths to be converted, instead of, or as well
    as, the current dir_path argument.
  • Optimize particular methods from PineconeDocumentStore
    (delete_documents and _get_vector_count)
  • Update the deepset Cloud SDK to the new endpoint format for new
    saving pipeline configs.
  • Add alias names for Cohere embed models for an easier map between
    names

Deprecation Notes

  • Deprecate OpenAIAnswerGenerator in
    favor of PromptNode.
    OpenAIAnswerGenerator will be removed
    in Haystack 1.23.

Bug Fixes

  • Fixed the bug that prevented the correct usage of ChatGPT invocation
    layer in 1.21.1. Added async support for ChatGPT invocation layer.
  • Added documents_store.update_embeddings call to pipeline examples so
    that embeddings are calculated for newly added documents.
  • Remove unsupported medium and
    finance-sentiment models from
    supported Cohere embed model list

Haystack 2.0 preview

  • Add AzureOCRDocumentConverter to convert files of different types
    using Azure's Document Intelligence Service.
  • Add ByteStream type to send binary raw data across components in a
    pipeline.
  • Introduce ChatMessage data class to facilitate structured handling
    and processing of message content within LLM chat interactions.
  • Adds ChatMessage templating in
    PromptBuilder
  • Adds HTMLToDocument component to convert HTML to a Document.
  • Adds SimilarityRanker, a component that ranks a list of Documents
    based on their similarity to the query.
  • Introduce the StreamingChunk dataclass for efficiently handling
    chunks of data streamed from a language model, encapsulating both
    the content and associated metadata for systematic processing.
  • Adds TopPSampler, a component selects documents based on the
    cumulative probability of the Document scores using top p (nucleus)
    sampling.
  • Add dumps,
    dump,
    loads and
    load methods to save and load
    pipelines in Yaml format.
  • Adopt Hugging Face token instead of
    the deprecated use_auth_token. Add
    this parameter to ExtractiveReader
    and SimilarityRanker to allow loading
    private models. Proper handling of
    token during serialization: if it is
    a string (a possible valid token) it is not serialized.
  • Add mime_type field to
    ByteStream dataclass.
  • The Document dataclass checks if
    id_hash_keys is None or empty in
    __post_init__. If so, it uses the default factory to set a
    default valid value.
  • Rework Document.id generation, if an
    id is not explicitly set it's
    generated using all Document field'
    values, score is not used.
  • Change Document's
    embedding field type from
    numpy.ndarray to
    List[float]
  • Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to
    ignore id_hash_keys and create Documents with duplicate ids if the
    documents differed only in their metadata.
  • Fix TextDocumentSplitter failing when run with an empty list
  • Better management of API key in GPT Generator. The API key is never
    serialized. Make the api_base_url
    parameter really used (previously it was ignored).
  • Add a minimal version of HuggingFaceLocalGenerator, a component that
    can run Hugging Face models locally to generate text.
  • Migrate RemoteWhisperTranscriber to OpenAI SDK.
  • Add OpenAI Document Embedder. It computes embeddings of Documents
    using OpenAI models. The embedding of each Document is stored in the
    embedding field of the Document.
  • Add the TextDocumentSplitter
    component for Haystack 2.0 that splits a Document with long text
    into multiple Documents with shorter texts. Thereby the texts match
    the maximum length that the language models in Embedders or other
    components can process.
  • Refactor OpenAIDocumentEmbedder to enrich documents with embeddings
    instead of recreating them.
  • Refactor SentenceTransformersDocumentEmbedder to enrich documents
    with embeddings instead of recreating them.
  • Remove "api_key" from serialization of AzureOCRDocumentConverter and
    SerperDevWebSearch.
  • Removed implementations of from_dict and to_dict from all components
    where they had the same effect as the default implementation from
    Canals:
    https://github.com/deepset-ai/canals/blob/main/canals/serialization.py#L12-L13
    This refactoring does not change the behavior of the components.
  • Remove array field from
    Document dataclass.
  • Remove id_hash_keys field from
    Document dataclass.
    id_hash_keys has been also removed
    from Components that were using it:
    • DocumentCleaner
    • TextDocumentSplitter
    • PyPDFToDocument
    • AzureOCRDocumentConverter
    • HTMLToDocument
    • TextFileToDocument
    • TikaDocumentConverter
  • Enhanced file routing capabilities with the introduction of
    ByteStream handling, and improved
    clarity by renaming the router to
    FileTypeRouter.
  • Rename MemoryDocumentStore to
    InMemoryDocumentStore Rename
    MemoryBM25Retriever to
    InMemoryBM25Retriever Rename
    MemoryEmbeddingRetriever to
    InMemoryEmbeddingRetriever
  • Renamed ExtractiveReader's input from
    document to
    documents to match its type
    List[Document].
  • Rename SimilarityRanker to
    TransformersSimilarityRanker, as
    there will be more similarity rankers in the future.
  • Allow specifying stopwords to stop text generation for
    HuggingFaceLocalGenerator.
  • Add basic telemetry to Haystack 2.0 pipelines
  • Added DocumentCleaner, which removes extra whitespace, empty lines,
    headers, etc. from Documents containing text. Useful as a
    preprocessing step before splitting into shorter text documents.
  • Add TextLanguageClassifier component so that an input string, for
    example a query, can be routed to different components based on the
    detected language.
  • Upgrade canals to 0.9.0 to support variadic inputs for Joiner
    components and "/" in connection names like "text/plain"