v2.0.0-beta.5
Pre-releaseRelease Notes
v2.0.0-beta.5
⬆️ Upgrade Notes
-
Implement framework-agnostic device representations. The main impetus behind this change is to move away from stringified representations of devices that are not portable between different frameworks. It also enables support for multi-device inference in a generic manner.
Going forward, components can expose a single, optional device parameter in their constructor (Optional[ComponentDevice]):
import haystack.utils import ComponentDevice, Device, DeviceMap
class MyComponent(Component):
def __init__(self, device: Optional[ComponentDevice] = None):
# If device is None, automatically select a device.
self.device = ComponentDevice.resolve_device(device)
def warm_up(self):
# Call the framework-specific conversion method.
self.model = AutoModel.from_pretrained("deepset/bert-base-cased-squad2", device=self.device.to_hf())
# Automatically selects a device.
c = MyComponent(device=None)
# Uses the first GPU available.
c = MyComponent(device=ComponentDevice.from_str("cuda:0"))
# Uses the CPU.
c = MyComponent(device=ComponentDevice.from_single(Device.cpu()))
# Allow the component to use multiple devices using a device map.
c = MyComponent(device=ComponentDevice.from_multiple(
DeviceMap({
"layer1": Device.cpu(),
"layer2": Device.gpu(1),
"layer3": Device.disk()
})
))
-
Change any occurrence of:
from haystack.components.routers.document_joiner import DocumentJoiner
to:
from haystack.components.joiners.document_joiner import DocumentJoiner
-
Change the imports for in_memory document store and retrievers from:
from haystack.document_stores import InMemoryDocumentStore from haystack.components.retrievers import InMemoryEmbeddingRetriever
to:
from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
-
Rename the transcriber parameters
model_name
andmodel_name_or_path
tomodel
. This change affects bothLocalWhisperTranscriber
andRemoteWhisperTranscriber
classes. -
Rename the embedder parameters
model_name
andmodel_name_or_path
tomodel
. This change affects all Embedder classes. -
Rename
model_name_or_path
tomodel
inNamedEntityExtractor
. -
Rename
model_name_or_path
tomodel
inTransformersSimilarityRanker
. -
Rename parameter
model_name_or_path
tomodel
inExtractiveReader
. -
Rename the generator parameters
model_name
andmodel_name_or_path
tomodel
. This change affects all Generator classes.
🚀 New Features
-
Adds
calculate_metrics()
function to EvaluationResult for computation of evaluation metrics. AddsMetric
class to store list of available metrics. AddsMetricsResult
class to store the metric values computed during the evaluation. -
Added a new extractor component, namely
NamedEntityExtractor
. This component accepts a list of Documents as its input - the raw text in the documents are annotated by the extractor and the annotations are stored in the document's meta dictionary (under the key named_entities).
The component is designed to support multiple NER backends, and the current implementations support two at the moment: Hugging Face and spaCy. These two backends implement support for any HF/spaCy model that supports token classification/NER respectively. -
Add `component.set_input_type() function to set a Component input name, type and default value.
-
Adds support for single metadata dictionary input in
MarkdownToDocument
. -
Adds support for single metadata dictionary input in
TikaDocumentConverter
.
⚡️ Enhancement Notes
- Add a field called
default_value
to theInputSocket
dataclass. Deriveis_mandatory
value from the presence ofdefault_value
. - Added
split_by
"page" to DocumentSplitter, which will split the document at "\f" - Modify the output type of
CacheChecker
fromList[Any]
toList
to make it possible to connect it in a Pipeline. - Highlight optional connections in the
Pipeline.draw()
output. - Improve the
URLCacheChecker
so that it can work with any type of data in the DocumentStore, not just URL caching. Rename the component toCacheChecker
. - Prevent the
MetaFieldRanker
from throwing an error if one or more of the documents doesn't contain the specific meta data field. Now those documents will be ignored for ranking purposes and placed at the end of the ranked list so we don't completely throw them away. Adding a sort_order that can have values of descending or ascending. Added more runtime parameters. - Create a new package called
joiners
and moveDocumentJoiner
there for clarity. - Stop exposing
in_memory
package symbols in thehaystack.document_store
and <shaystack.components.retrievers
root namespaces. - Add example script about how to use Multiplexer to route meta to file converters.
- Adds support for single metadata dictionary input in
AzureOCRDocumentConverter
. In this way, additional metadata can be added to all files processed by this component even when the length of the list of sources is unknown.
🐛 Bug Fixes
- Fix ComponentMeta ignoring keyword-only parameters in the
run
method. ComponentMeta.__call__ handles the creation of InputSockets for the component's inputs when the latter has not explicitly called _Component.set_input_types(). This logic was not correctly handling keyword-only parameters. - Fixes the error descriptor '__dict__' for 'ComponentClassX' objects doesn't apply to a 'ComponentClassX' object when calling dir() on a component instance. This fix should allow auto-completion in code editors.
- Prevent
InMemoryBM25Retriever
from returning documents with a score of 0.0. - Fix
pytest
breaking in VSCode due to a name collision in the RAG pipeline tests. - Correctly handle the serialization and deserialization of torch.dtype. This concerns the following components: ExtractiveReader, HuggingFaceLocalGenerator, and TransformersSimilarityRanker.