Refactor the Linker, Ranker, Recogniser and Pipeline #276

thobson88 · 2024-07-02T17:50:50Z

We should consider implementing the different linking algorithms in subclasses of Linker.

This would avoid redundant logic based on String parameters like:

def run(self, dict_mention: dict) -> Tuple[str, float, dict]:
    if self.method == "mostpopular":
        return self.most_popular(dict_mention)

    if self.method == "bydistance":
        return self.by_distance(dict_mention)
    
    raise SyntaxError(f"Unknown method provided: {self.method}")

Instead the run method would be implemented differently in two subclasses: MostPopularLinker and ByDistanceLinker.

Also the train_load_model, which only makes sense if the method is reldisamb, would then live in (another) subclass RelLinker.

Ideally we would then improve the conditional logic in the run_disambiguation method in pipeline.py, so that the three linker methods (reldisamb, mostpopular and bydistance) are treated consistently.

Main tasks:

Timeline estimate

~~PR ready for review by 28th Nov~~
~~+2 weeks for review and update docs~~
code & docs ready and reviewed by end of Dec
meeting with Mariona on Dec 19th

The text was updated successfully, but these errors were encountered:

thobson88 · 2024-07-04T13:49:45Z

Related: in the Pipeline class there is some duplicated code.

Specifically, the code inside run_sentence from the comment:

        # If the linking method is "reldisamb", rank and format candidates,
        # and produce a prediction:

to the end of the method, is almost identical to that in the run_disambiguation method from the same comment onwards.

The only difference (apart from post-processing being optional) is, in run_sentence:

                # Run entity linking per mention:
                selected_cand = self.mylinker.run(
                    {
                        "candidates": wk_cands[mention["mention"]],
                        "place_wqid": place_wqid,
                    }
                )

versus, in run_disambiguation:

                # Run entity linking per mention:
                selected_cand = self.mylinker.run(
                    {
                        "candidates": wk_cands[mention["mention"]],
                        "place_wqid": "",
                    }
                )

(Note that this difference only affects the "mostpopular" & "bydistance" linker methods, not the "reldisamb" method.)

Similarly, the methods run_text and run_text_recognition contain some duplicated logic. Both start with this:

        # Split the text into its sentences:
        sentences = split_text_into_sentences(text, language="en")

        document_dataset = []
        for idx, sentence in enumerate(sentences):
            # Get context (prev and next sentence)
            context = ["", ""]
            if idx - 1 >= 0:
                context[0] = sentences[idx - 1]
            if idx + 1 < len(sentences):
                context[1] = sentences[idx + 1]

Then run_text_recognition makes a call to run_sentence_recognition. Whereas run_text calls run_sentence. But run_sentence begins with a call to run_sentence_recognition. So the logic is semi-duplicated, with some (potentially important) differences.

thobson88 · 2024-07-04T14:06:37Z

Apart from avoiding code duplication, the aim in refactoring the Pipeline will be to make it a thin wrapper around the Ranker and Linker, rather than containing its own "business logic" as it currently does.

It should be possible to run the components of the pipeline separately by calling methods on the Ranker and Linker, not the Pipeline, and get exactly the same results as running the full pipeline. Currently that is not possible.

thobson88 · 2024-07-08T10:46:17Z

There may be a similar issue relating to candidate ranking: currently the Ranker class is used when ranking with DeezyMatch, but when the linker method is reldisamb an additional ranking step is applied in the Pipeline class (here and here, which are also duplicating code).

A good start would be to have a subclass for each ranking method, to replace this block in the run method:

        if self.method == "perfectmatch":
            return self.perfect_match(queries)
        if self.method == "partialmatch":
            return self.partial_match(queries, damlev=False)
        if self.method == "levenshtein":
            return self.partial_match(queries, damlev=True)
        if self.method == "deezymatch":
            return self.deezy_on_the_fly(queries)
        raise SyntaxError(f"Unknown method: {self.method}")

This approach will also improve other Ranker methods, such astrain(), which currently does nothing unless the ranking method is deezymatch. After refactoring into subclasses, this method will only appear on the DeezyMatchRanker subclass.

thobson88 · 2024-08-15T12:32:10Z

There's another apparent issue relating to the way the Wikidata resource mentions_to_wikidata.json is used in the mostpopular linking method. See this comment for details.

thobson88 · 2024-09-11T16:42:09Z

In progress on branch 276-refactor.

thobson88 · 2024-10-17T08:52:57Z

On hold, pending #278. A set of canonical resources is needed to achieve consistency of unit/integration test results, which is itself a prerequisite for refactoring.

thobson88 · 2024-10-30T14:42:48Z

The Recogniser class has a create_pipeline method that creates a (HuggingFace) transformers.Pipeline object and assigns it to self.pipe. Then it returns self.pipe, and when the create_pipeline method is called in the Pipeline constructor, it assigns the result to self.myner.pipe.

Also, we have model_path & base_model attributes and a load_from_hub optional boolean attribute which, if True, makes those other attributes - and the train method - redundant (because in that case a pre-trained model is loaded from the HuggingFace model Hub).

Subclasses will help here.

thobson88 · 2024-10-31T14:17:34Z

Merged PR 270 into dev and rebased 276-refactor on top.

thobson88 · 2024-11-01T10:56:22Z

The optional already_collected_cands argument to the Ranker constructor has default value dict(), but when running all tests there are occasions where this attribute ends up with a non-empty dictionary, causing tests to fail. The temporary fix for this has been to pass an empty dict explicitly to the constructor, but this should be made more robust.

thobson88 · 2024-11-05T11:49:27Z

TODO: update the notebooks in the examples folder.

thobson88 · 2024-11-06T09:36:03Z

Given the similarities between the Recogniser, Linker and Ranker, we should aim for a common interface (or at least more consistent). In all three cases the usage looks like:

instantiate
load resources (perhaps including training a model, depending on the subclass type and specified parameters)
execute the main process.

Consistency can be achieved by having:

choice of method based on instantiation of a particular subclass
common naming convention for the "load resources" method which prepares for execution (e.g. load)
common naming convention for the execution method (e.g. run)

For example: in the case of the Linker, the new load method will replace (or incorporate) both the existing load_resources and train_load_model methods.

Note: the load method should not return anything, but should assign an instance attribute (e.g. self.pipe in the case of the Recogniser). Then the run method can check that the attribute is assigned and, if not, either raise an error or make a call to load.

Once this is done, it should be possible to write the pipeline using a high level of abstraction (i.e. treating each of the three types identically, after instantiation). However it's only worth doing this if it provides a clear benefit.

In any case, the Pipeline class should also have the same interface as the others.

thobson88 · 2024-11-06T10:49:57Z

TODO: module names in t_res/geoparser. Currently they are linking.py, ranking.py and recogniser.py. For consistency, either the latter should be "recognising.py" (or, less clumsily, ner.py), or the two former should be linker.py and ranker.py.

Also the variable (and instance attribute) currently named myner (or just ner) ought to be recogniser (as the others are linker and ranker).

thobson88 · 2024-11-06T11:33:00Z

TODO: rename the utils/ner.py module as utils/ner_utils.py (consistent with rel_utils.py) to avoid the name conflict with the ner instance attribute in the Pipeline class. Then rename the attributes: myner -> ner, myranker -> ranker and mylinker -> linker.

thobson88 · 2024-11-06T12:03:46Z

TODO: add guard clauses against empty string or None in the resources_path passed to Ranker and Linker constructors. Then simplify the default instantiations in the Pipeline constructor.

thobson88 · 2024-11-06T14:06:05Z

TODO: create a Prediction class to represent a prediction from the T-Res pipeline (currently this is just a dictionary) and move the format_prediction method out of the Pipeline class.

Similar for Mention to be returned by the Recogniser.

Similar for Candidate to be returned by the Ranker find_candidates method.

thobson88 · 2024-11-06T16:47:14Z

TODO: in the Ranker, the find_candidates is a wrapper around the run method. This confuses the interface as it's unclear when to use each of these methods. For instance, find_candidates is called in the API run_candidate_selection endpoint, but it's essentially just calling run and then doing some post-processing to add the Wikidata candidates.

Also, both run and find_candidates return the already_collected_cands but these are typically ignored by the caller and if not ignored, they're just used to assign a Ranker attribute which could be done inside the method anyway. (In fact it's the instance attribute self.already_collected_cands that gets returned, which is obviously redundant because the caller could just access that attribute after the call.)

So to tidy this up we shall:

return a new type Candidate from the run method, that will overcome the distinction between cands and wk_cands in find_candidates by handling the Wikidata logic internally
get rid of the find_candidates method, as the above makes it redundant, and only use run
stop returning already_collected_cands (in all cases) and instead just assign to the instance attribute

thobson88 · 2024-11-09T08:01:59Z

TODO: currently the Ranker attaches Wikidata scores in find_candidates, but the Ranker's job is to rank string mentions and should be independent of Wikidata (which is used for linking to a Wikidata ID, a completely different "ranking"). So the Ranker should never see the Wikidata resources, instead that logic belongs in the Linker.

Note: this clarifies an important point, namely, both the Ranker and Linker perform a ranking process and pick the best candidate. The Ranker ranks string variations according to the quality of the string match (using methods such as perfect match, Levenshtein, DeezyMatch) while the Linker ranks Wikidata IDs according to some linking metric (such as most popular page in Wikidata, physical proximity, or REL score).

thobson88 · 2024-11-09T08:37:14Z

TODO: get rid of hard-coded selection of resources inside load_resources methods, e.g. in the Ranker which picks the normalized mentions as opposed to the absolute values.

thobson88 · 2024-11-13T10:14:49Z

TODO: Split the DeezyMatchRanker into two subclasses: Pretrained and Custom. Only the latter needs training parameters and resources. The former just needs the path to the trained model.

thobson88 · 2024-11-14T10:28:50Z

TODO: implement and document dataclasses for Recogniser, Ranker, Linker run method output types. This is a big change so there's a separate ticket for the dataclass structure.

thobson88 · 2024-11-21T09:59:16Z

Ranker class structure (see udpated version below):

Key: see this link

thobson88 · 2024-12-05T14:03:59Z

Pipeline refactoring

For discussion at the PR review stage: these lines in pipeline.py:

                if not len(m["mention"]) == 1 and not m["mention"].islower():
                    mentions_dataset.append(prediction)

have the effect of filtering out (i.e. discarding) any toponym mentions made up of only lowercase characters.

Given the issues with OCR quality, I don't think this is desired behaviour. Therefore this filter has been removed in the refactored pipeline.

Note: this change impacts an assertion in the test_modular_deezy_rel test case inside test_pipeline.py.

thobson88 · 2024-12-13T14:59:11Z

Example notebooks

All have been updated to work with the refactored codebase, with the exception of:

train_use_deezy_model_1.ipynb
train_use_deezy_model_2.ipynb
train_use_deezy_model_2.ipynb
train_use_ner_model.ipynb

These notebooks are giving the same errors on the dev branch, so are not broken as a result of the refactoring. Therefore I'm leaving them aside for now.

thobson88 · 2024-12-13T15:13:16Z

Linker refactoring

Unlike the other components of the pipeline, the Linker has an extra "business logic" method named disambiguate, in addition to the usual run method.

The reason for this is that the run method identifies candidate links for a specific mention whereas, in general, the disambiguation step requires access to all of the candidates/predictions (for the whole text) at the same time. So the inputs these two operations are quite different.

Previously the disambiguation step was (in the case of a REL model) being done inside the Pipeline, not the Linker. But this meant wasn't ideal because: i) the Pipeline included disambiguation business logic and ii) it had to discriminate between the different linking methods. By moving this logic into the Linker subclasses, both of these problems are avoided.

thobson88 · 2024-12-19T11:48:37Z

Recogniser class structure:

The run method:

def run(self, sentence: str) -> SentenceMentions:

replaces the previous ner_predict method:

def ner_predict(self, sentence: str) -> List[dict]:

thobson88 · 2024-12-19T11:49:07Z

Ranker class structure:

The run method (overridden in each subclass):

def run(self, mention: Mention) -> CandidateMatches:

replaces the previous methods:

def perfect_match(self, queries: List[str]) -> Tuple[dict, dict]:

def partial_match(self, queries: List[str], damlev: bool) -> Tuple[dict, dict]:

def deezy_on_the_fly(self, queries: List[str]) -> Tuple[dict, dict]:

thobson88 · 2024-12-19T11:50:29Z

Linker class structure:

The run method:

def run(
        self, 
        matches: CandidateMatches, 
        place_of_pub_wqid: Optional[str]=None,
        place_of_pub: Optional[str]=None,
    ) -> MentionCandidates:

replaces previous methods:

def most_popular(self, dict_mention: dict) -> Tuple[str, float, dict]:

def by_distance(self, dict_mention: dict, origin_wqid: Optional[str] = "") -> Tuple[str, float, dict]:

The disambiguate and disambiguation_scores methods (overridden in each subclass) compute disambiguation scores:

def disambiguate(self, candidates: List[SentenceCandidates]) -> Predictions:

replacing, in the case of the reldisamb linking method, logic that was in the Pipeline and in the rank_candidates function in rel_utils.py:

def rank_candidates(rel_json: dict, wk_cands: dict) -> dict:

thobson88 · 2024-12-19T11:51:02Z

Pipeline class structure:

thobson88 self-assigned this Sep 5, 2024

thobson88 closed this as completed Oct 17, 2024

thobson88 reopened this Oct 18, 2024

thobson88 changed the title ~~Refactor the Linker~~ Refactor the Linker, Ranker and Pipeline Oct 30, 2024

thobson88 changed the title ~~Refactor the Linker, Ranker and Pipeline~~ Refactor the Linker, Ranker, Recogniser and Pipeline Oct 30, 2024

thobson88 mentioned this issue Nov 14, 2024

Dataclasses for Recogniser, Ranker and Linker output data types #281

Closed

thobson88 linked a pull request Dec 15, 2024 that will close this issue

Refactor the Linker, Ranker, Recogniser and Pipeline #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the Linker, Ranker, Recogniser and Pipeline #276

Refactor the Linker, Ranker, Recogniser and Pipeline #276

thobson88 commented Jul 2, 2024 •

edited

Loading

thobson88 commented Jul 4, 2024 •

edited

Loading

thobson88 commented Jul 4, 2024

thobson88 commented Jul 8, 2024 •

edited

Loading

thobson88 commented Aug 15, 2024

thobson88 commented Sep 11, 2024

thobson88 commented Oct 17, 2024 •

edited

Loading

thobson88 commented Oct 30, 2024

thobson88 commented Oct 31, 2024

thobson88 commented Nov 1, 2024

thobson88 commented Nov 5, 2024

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 9, 2024 •

edited

Loading

thobson88 commented Nov 9, 2024 •

edited

Loading

thobson88 commented Nov 13, 2024

thobson88 commented Nov 14, 2024

thobson88 commented Nov 21, 2024 •

edited

Loading

thobson88 commented Dec 5, 2024

thobson88 commented Dec 13, 2024

thobson88 commented Dec 13, 2024

thobson88 commented Dec 19, 2024 •

edited

Loading

thobson88 commented Dec 19, 2024 •

edited

Loading

thobson88 commented Dec 19, 2024 •

edited

Loading

thobson88 commented Dec 19, 2024

Refactor the Linker, Ranker, Recogniser and Pipeline #276

Refactor the Linker, Ranker, Recogniser and Pipeline #276

Comments

thobson88 commented Jul 2, 2024 • edited Loading

Timeline estimate

thobson88 commented Jul 4, 2024 • edited Loading

thobson88 commented Jul 4, 2024

thobson88 commented Jul 8, 2024 • edited Loading

thobson88 commented Aug 15, 2024

thobson88 commented Sep 11, 2024

thobson88 commented Oct 17, 2024 • edited Loading

thobson88 commented Oct 30, 2024

thobson88 commented Oct 31, 2024

thobson88 commented Nov 1, 2024

thobson88 commented Nov 5, 2024

thobson88 commented Nov 6, 2024 • edited Loading

thobson88 commented Nov 6, 2024 • edited Loading

thobson88 commented Nov 6, 2024 • edited Loading

thobson88 commented Nov 6, 2024

thobson88 commented Nov 6, 2024 • edited Loading

thobson88 commented Nov 6, 2024 • edited Loading

thobson88 commented Nov 9, 2024 • edited Loading

thobson88 commented Nov 9, 2024 • edited Loading

thobson88 commented Nov 13, 2024

thobson88 commented Nov 14, 2024

thobson88 commented Nov 21, 2024 • edited Loading

Ranker class structure (see udpated version below):

thobson88 commented Dec 5, 2024

Pipeline refactoring

thobson88 commented Dec 13, 2024

Example notebooks

thobson88 commented Dec 13, 2024

Linker refactoring

thobson88 commented Dec 19, 2024 • edited Loading

Recogniser class structure:

thobson88 commented Dec 19, 2024 • edited Loading

Ranker class structure:

thobson88 commented Dec 19, 2024 • edited Loading

Linker class structure:

thobson88 commented Dec 19, 2024

Pipeline class structure:

thobson88 commented Jul 2, 2024 •

edited

Loading

thobson88 commented Jul 4, 2024 •

edited

Loading

thobson88 commented Jul 8, 2024 •

edited

Loading

thobson88 commented Oct 17, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 6, 2024 •

edited

Loading

thobson88 commented Nov 9, 2024 •

edited

Loading

thobson88 commented Nov 9, 2024 •

edited

Loading

thobson88 commented Nov 21, 2024 •

edited

Loading

thobson88 commented Dec 19, 2024 •

edited

Loading

thobson88 commented Dec 19, 2024 •

edited

Loading

thobson88 commented Dec 19, 2024 •

edited

Loading