-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the Linker, Ranker, Recogniser and Pipeline #276
Comments
Related: in the Pipeline class there is some duplicated code. Specifically, the code inside # If the linking method is "reldisamb", rank and format candidates,
# and produce a prediction: to the end of the method, is almost identical to that in the The only difference (apart from post-processing being optional) is, in # Run entity linking per mention:
selected_cand = self.mylinker.run(
{
"candidates": wk_cands[mention["mention"]],
"place_wqid": place_wqid,
}
) versus, in # Run entity linking per mention:
selected_cand = self.mylinker.run(
{
"candidates": wk_cands[mention["mention"]],
"place_wqid": "",
}
) (Note that this difference only affects the "mostpopular" & "bydistance" linker methods, not the "reldisamb" method.) Similarly, the methods # Split the text into its sentences:
sentences = split_text_into_sentences(text, language="en")
document_dataset = []
for idx, sentence in enumerate(sentences):
# Get context (prev and next sentence)
context = ["", ""]
if idx - 1 >= 0:
context[0] = sentences[idx - 1]
if idx + 1 < len(sentences):
context[1] = sentences[idx + 1] Then |
Apart from avoiding code duplication, the aim in refactoring the Pipeline will be to make it a thin wrapper around the Ranker and Linker, rather than containing its own "business logic" as it currently does. It should be possible to run the components of the pipeline separately by calling methods on the Ranker and Linker, not the Pipeline, and get exactly the same results as running the full pipeline. Currently that is not possible. |
There may be a similar issue relating to candidate ranking: currently the A good start would be to have a subclass for each ranking method, to replace this block in the
This approach will also improve other |
There's another apparent issue relating to the way the Wikidata resource |
In progress on branch |
On hold, pending #278. A set of canonical resources is needed to achieve consistency of unit/integration test results, which is itself a prerequisite for refactoring. |
The Also, we have Subclasses will help here. |
Merged PR 270 into |
The optional |
TODO: update the notebooks in the |
Given the similarities between the Recogniser, Linker and Ranker, we should aim for a common interface (or at least more consistent). In all three cases the usage looks like:
Consistency can be achieved by having:
For example: in the case of the Note: the Once this is done, it should be possible to write the pipeline using a high level of abstraction (i.e. treating each of the three types identically, after instantiation). However it's only worth doing this if it provides a clear benefit. In any case, the Pipeline class should also have the same interface as the others. |
TODO: module names in Also the variable (and instance attribute) currently named |
TODO: rename the |
TODO: add guard clauses against empty string or |
TODO: create a Similar for Similar for |
TODO: in the Also, both So to tidy this up we shall:
|
TODO: currently the Ranker attaches Wikidata scores in Note: this clarifies an important point, namely, both the |
TODO: get rid of hard-coded selection of resources inside |
TODO: Split the |
TODO: implement and document dataclasses for Recogniser, Ranker, Linker |
Ranker class structure (see udpated version below):Key: see this link |
Pipeline refactoringFor discussion at the PR review stage: these lines in if not len(m["mention"]) == 1 and not m["mention"].islower():
mentions_dataset.append(prediction) have the effect of filtering out (i.e. discarding) any toponym mentions made up of only lowercase characters. Given the issues with OCR quality, I don't think this is desired behaviour. Therefore this filter has been removed in the refactored pipeline. Note: this change impacts an assertion in the |
Example notebooksAll have been updated to work with the refactored codebase, with the exception of:
These notebooks are giving the same errors on the |
Linker refactoringUnlike the other components of the pipeline, the Linker has an extra "business logic" method named The reason for this is that the Previously the disambiguation step was (in the case of a REL model) being done inside the Pipeline, not the Linker. But this meant wasn't ideal because: i) the Pipeline included disambiguation business logic and ii) it had to discriminate between the different linking methods. By moving this logic into the Linker subclasses, both of these problems are avoided. |
Recogniser class structure:The def run(self, sentence: str) -> SentenceMentions: replaces the previous def ner_predict(self, sentence: str) -> List[dict]: |
Ranker class structure:The def run(self, mention: Mention) -> CandidateMatches: replaces the previous methods: def perfect_match(self, queries: List[str]) -> Tuple[dict, dict]: def partial_match(self, queries: List[str], damlev: bool) -> Tuple[dict, dict]: def deezy_on_the_fly(self, queries: List[str]) -> Tuple[dict, dict]: |
Linker class structure:The def run(
self,
matches: CandidateMatches,
place_of_pub_wqid: Optional[str]=None,
place_of_pub: Optional[str]=None,
) -> MentionCandidates: replaces previous methods: def most_popular(self, dict_mention: dict) -> Tuple[str, float, dict]: def by_distance(self, dict_mention: dict, origin_wqid: Optional[str] = "") -> Tuple[str, float, dict]: The def disambiguate(self, candidates: List[SentenceCandidates]) -> Predictions: replacing, in the case of the def rank_candidates(rel_json: dict, wk_cands: dict) -> dict: |
We should consider implementing the different linking algorithms in subclasses of
Linker
.This would avoid redundant logic based on String parameters like:
Instead the
run
method would be implemented differently in two subclasses:MostPopularLinker
andByDistanceLinker
.Also the
train_load_model
, which only makes sense if themethod
isreldisamb
, would then live in (another) subclassRelLinker
.Ideally we would then improve the conditional logic in the
run_disambiguation
method inpipeline.py
, so that the three linker methods (reldisamb
,mostpopular
andbydistance
) are treated consistently.Main tasks:
toponym_resolution.py
module and test adeezymatch
experimentrecogniser.py
toner.py
for consistency with ranker and linker.pipeline.py
andrel_utils.py
andner.py
)Timeline estimate
PR ready for review by 28th Nov+2 weeks for review and update docsThe text was updated successfully, but these errors were encountered: