You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
Is your feature request related to a problem? Please describe.
On the current version of 2.7.0 of allennlp and versions 4.11.3 of transformers, layoutlmv2 is not supported :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/allennlp/allennlp/modules/token_embedders/pretrained_transformer_mismatched_embedder.py", line 80, in __init__
self._matched_embedder = PretrainedTransformerEmbedder(
File "/root/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 123, in __init__
tokenizer = PretrainedTransformerTokenizer(
File "/root/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 79, in __init__
self._reverse_engineer_special_tokens("a", "b", model_name, tokenizer_kwargs)
File "/root/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 112, in _reverse_engineer_special_tokens
dummy_output = tokenizer_with_special_tokens.encode_plus(
File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 430, in encode_plus
return self._encode_plus(
File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 639, in _encode_plus
batched_output = self._batch_encode_plus(
File "/root/anaconda3/envs/alenlayout/lib/python3.8/site-packages/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py", line 493, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
TypeError: PreTokenizedInputSequence must be Union[List[str], Tuple[str]]
Error occurs since they added an argument boxes as second argument of the fast layoutlm_v2 tokenizer which breaks the reverse engineer of the special tokens of allennlp pretrained_transformer_tokenizer.
Describe the solution you'd like
Ideally, naming the arguments in tokenizer_with_special_tokens.encode_plus of pretrained_transformer_tokenizer should do the work but I'm afraid of repercussions on other tokenizer that have different argument name (those not based of Bert maybe?)
Moreover since layoutlm_v2 added a few input to the model (images and boxes), modifications should be made to _unfold_long_sequences, _fold_long_sequences and forward of the pretrained_transformer_embedder and pretrained_transformer_mismatched_embedder to account for additional inputs.
If it's okay with you, I'd like to work in it.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
On the current version of 2.7.0 of allennlp and versions 4.11.3 of transformers, layoutlmv2 is not supported :
Error occurs since they added an argument
boxes
as second argument of the fast layoutlm_v2 tokenizer which breaks the reverse engineer of the special tokens of allennlp pretrained_transformer_tokenizer.Describe the solution you'd like
Ideally, naming the arguments in
tokenizer_with_special_tokens.encode_plus
of pretrained_transformer_tokenizer should do the work but I'm afraid of repercussions on other tokenizer that have different argument name (those not based of Bert maybe?)Moreover since layoutlm_v2 added a few input to the model (images and boxes), modifications should be made to _unfold_long_sequences, _fold_long_sequences and forward of the pretrained_transformer_embedder and pretrained_transformer_mismatched_embedder to account for additional inputs.
If it's okay with you, I'd like to work in it.
The text was updated successfully, but these errors were encountered: