Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the Linker, Ranker, Recogniser and Pipeline #276

Open
13 of 15 tasks
thobson88 opened this issue Jul 2, 2024 · 28 comments · May be fixed by #282
Open
13 of 15 tasks

Refactor the Linker, Ranker, Recogniser and Pipeline #276

thobson88 opened this issue Jul 2, 2024 · 28 comments · May be fixed by #282
Assignees

Comments

@thobson88
Copy link
Collaborator

thobson88 commented Jul 2, 2024

We should consider implementing the different linking algorithms in subclasses of Linker.

This would avoid redundant logic based on String parameters like:

def run(self, dict_mention: dict) -> Tuple[str, float, dict]:
    if self.method == "mostpopular":
        return self.most_popular(dict_mention)

    if self.method == "bydistance":
        return self.by_distance(dict_mention)
    
    raise SyntaxError(f"Unknown method provided: {self.method}")

Instead the run method would be implemented differently in two subclasses: MostPopularLinker and ByDistanceLinker.

Also the train_load_model, which only makes sense if the method is reldisamb, would then live in (another) subclass RelLinker.

Ideally we would then improve the conditional logic in the run_disambiguation method in pipeline.py, so that the three linker methods (reldisamb, mostpopular and bydistance) are treated consistently.

Main tasks:

  • get all tests to pass (unit and integration tests)
  • create dataclasses
  • refactor Linker into subclasses
  • refactor Ranker into subclasses
  • refactor Recogniser into subclasses
  • refactor Pipeline to avoid code duplication and make it a thin wrapper around the other three classes
  • move to a common interface for Linker, Ranker, Recogniser and Pipeline
  • update the toponym_resolution.py module and test a deezymatch experiment
  • rename recogniser.py to ner.py for consistency with ranker and linker.
  • update the example notebooks
  • update the app (i.e. the T-Res API)
  • remove deprecated code (from pipeline.py and rel_utils.py and ner.py)
  • open PR Refactor the Linker, Ranker, Recogniser and Pipeline #282
  • update docstrings & comments
  • update the docs

Timeline estimate

  • PR ready for review by 28th Nov
  • +2 weeks for review and update docs
  • code & docs ready and reviewed by end of Dec
  • meeting with Mariona on Dec 19th
@thobson88
Copy link
Collaborator Author

thobson88 commented Jul 4, 2024

Related: in the Pipeline class there is some duplicated code.

Specifically, the code inside run_sentence from the comment:

        # If the linking method is "reldisamb", rank and format candidates,
        # and produce a prediction:

to the end of the method, is almost identical to that in the run_disambiguation method from the same comment onwards.

The only difference (apart from post-processing being optional) is, in run_sentence:

                # Run entity linking per mention:
                selected_cand = self.mylinker.run(
                    {
                        "candidates": wk_cands[mention["mention"]],
                        "place_wqid": place_wqid,
                    }
                )

versus, in run_disambiguation:

                # Run entity linking per mention:
                selected_cand = self.mylinker.run(
                    {
                        "candidates": wk_cands[mention["mention"]],
                        "place_wqid": "",
                    }
                )

(Note that this difference only affects the "mostpopular" & "bydistance" linker methods, not the "reldisamb" method.)


Similarly, the methods run_text and run_text_recognition contain some duplicated logic. Both start with this:

        # Split the text into its sentences:
        sentences = split_text_into_sentences(text, language="en")

        document_dataset = []
        for idx, sentence in enumerate(sentences):
            # Get context (prev and next sentence)
            context = ["", ""]
            if idx - 1 >= 0:
                context[0] = sentences[idx - 1]
            if idx + 1 < len(sentences):
                context[1] = sentences[idx + 1]

Then run_text_recognition makes a call to run_sentence_recognition. Whereas run_text calls run_sentence. But run_sentence begins with a call to run_sentence_recognition. So the logic is semi-duplicated, with some (potentially important) differences.

@thobson88
Copy link
Collaborator Author

Apart from avoiding code duplication, the aim in refactoring the Pipeline will be to make it a thin wrapper around the Ranker and Linker, rather than containing its own "business logic" as it currently does.

It should be possible to run the components of the pipeline separately by calling methods on the Ranker and Linker, not the Pipeline, and get exactly the same results as running the full pipeline. Currently that is not possible.

@thobson88
Copy link
Collaborator Author

thobson88 commented Jul 8, 2024

There may be a similar issue relating to candidate ranking: currently the Ranker class is used when ranking with DeezyMatch, but when the linker method is reldisamb an additional ranking step is applied in the Pipeline class (here and here, which are also duplicating code).

A good start would be to have a subclass for each ranking method, to replace this block in the run method:

        if self.method == "perfectmatch":
            return self.perfect_match(queries)
        if self.method == "partialmatch":
            return self.partial_match(queries, damlev=False)
        if self.method == "levenshtein":
            return self.partial_match(queries, damlev=True)
        if self.method == "deezymatch":
            return self.deezy_on_the_fly(queries)
        raise SyntaxError(f"Unknown method: {self.method}")

This approach will also improve other Ranker methods, such astrain(), which currently does nothing unless the ranking method is deezymatch. After refactoring into subclasses, this method will only appear on the DeezyMatchRanker subclass.

@thobson88
Copy link
Collaborator Author

There's another apparent issue relating to the way the Wikidata resource mentions_to_wikidata.json is used in the mostpopular linking method. See this comment for details.

@thobson88 thobson88 self-assigned this Sep 5, 2024
@thobson88
Copy link
Collaborator Author

In progress on branch 276-refactor.

@thobson88
Copy link
Collaborator Author

thobson88 commented Oct 17, 2024

On hold, pending #278. A set of canonical resources is needed to achieve consistency of unit/integration test results, which is itself a prerequisite for refactoring.

@thobson88 thobson88 reopened this Oct 18, 2024
@thobson88 thobson88 changed the title Refactor the Linker Refactor the Linker, Ranker and Pipeline Oct 30, 2024
@thobson88 thobson88 changed the title Refactor the Linker, Ranker and Pipeline Refactor the Linker, Ranker, Recogniser and Pipeline Oct 30, 2024
@thobson88
Copy link
Collaborator Author

The Recogniser class has a create_pipeline method that creates a (HuggingFace) transformers.Pipeline object and assigns it to self.pipe. Then it returns self.pipe, and when the create_pipeline method is called in the Pipeline constructor, it assigns the result to self.myner.pipe.

Also, we have model_path & base_model attributes and a load_from_hub optional boolean attribute which, if True, makes those other attributes - and the train method - redundant (because in that case a pre-trained model is loaded from the HuggingFace model Hub).

Subclasses will help here.

@thobson88
Copy link
Collaborator Author

Merged PR 270 into dev and rebased 276-refactor on top.

@thobson88
Copy link
Collaborator Author

The optional already_collected_cands argument to the Ranker constructor has default value dict(), but when running all tests there are occasions where this attribute ends up with a non-empty dictionary, causing tests to fail. The temporary fix for this has been to pass an empty dict explicitly to the constructor, but this should be made more robust.

@thobson88
Copy link
Collaborator Author

TODO: update the notebooks in the examples folder.

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 6, 2024

Given the similarities between the Recogniser, Linker and Ranker, we should aim for a common interface (or at least more consistent). In all three cases the usage looks like:

  • instantiate
  • load resources (perhaps including training a model, depending on the subclass type and specified parameters)
  • execute the main process.

Consistency can be achieved by having:

  • choice of method based on instantiation of a particular subclass
  • common naming convention for the "load resources" method which prepares for execution (e.g. load)
  • common naming convention for the execution method (e.g. run)

For example: in the case of the Linker, the new load method will replace (or incorporate) both the existing load_resources and train_load_model methods.

Note: the load method should not return anything, but should assign an instance attribute (e.g. self.pipe in the case of the Recogniser). Then the run method can check that the attribute is assigned and, if not, either raise an error or make a call to load.

Once this is done, it should be possible to write the pipeline using a high level of abstraction (i.e. treating each of the three types identically, after instantiation). However it's only worth doing this if it provides a clear benefit.

In any case, the Pipeline class should also have the same interface as the others.

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 6, 2024

TODO: module names in t_res/geoparser. Currently they are linking.py, ranking.py and recogniser.py. For consistency, either the latter should be "recognising.py" (or, less clumsily, ner.py), or the two former should be linker.py and ranker.py.

Also the variable (and instance attribute) currently named myner (or just ner) ought to be recogniser (as the others are linker and ranker).

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 6, 2024

TODO: rename the utils/ner.py module as utils/ner_utils.py (consistent with rel_utils.py) to avoid the name conflict with the ner instance attribute in the Pipeline class. Then rename the attributes: myner -> ner, myranker -> ranker and mylinker -> linker.

@thobson88
Copy link
Collaborator Author

TODO: add guard clauses against empty string or None in the resources_path passed to Ranker and Linker constructors. Then simplify the default instantiations in the Pipeline constructor.

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 6, 2024

TODO: create a Prediction class to represent a prediction from the T-Res pipeline (currently this is just a dictionary) and move the format_prediction method out of the Pipeline class.

Similar for Mention to be returned by the Recogniser.

Similar for Candidate to be returned by the Ranker find_candidates method.

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 6, 2024

TODO: in the Ranker, the find_candidates is a wrapper around the run method. This confuses the interface as it's unclear when to use each of these methods. For instance, find_candidates is called in the API run_candidate_selection endpoint, but it's essentially just calling run and then doing some post-processing to add the Wikidata candidates.

Also, both run and find_candidates return the already_collected_cands but these are typically ignored by the caller and if not ignored, they're just used to assign a Ranker attribute which could be done inside the method anyway. (In fact it's the instance attribute self.already_collected_cands that gets returned, which is obviously redundant because the caller could just access that attribute after the call.)

So to tidy this up we shall:

  • return a new type Candidate from the run method, that will overcome the distinction between cands and wk_cands in find_candidates by handling the Wikidata logic internally
  • get rid of the find_candidates method, as the above makes it redundant, and only use run
  • stop returning already_collected_cands (in all cases) and instead just assign to the instance attribute

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 9, 2024

TODO: currently the Ranker attaches Wikidata scores in find_candidates, but the Ranker's job is to rank string mentions and should be independent of Wikidata (which is used for linking to a Wikidata ID, a completely different "ranking"). So the Ranker should never see the Wikidata resources, instead that logic belongs in the Linker.

Note: this clarifies an important point, namely, both the Ranker and Linker perform a ranking process and pick the best candidate. The Ranker ranks string variations according to the quality of the string match (using methods such as perfect match, Levenshtein, DeezyMatch) while the Linker ranks Wikidata IDs according to some linking metric (such as most popular page in Wikidata, physical proximity, or REL score).

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 9, 2024

TODO: get rid of hard-coded selection of resources inside load_resources methods, e.g. in the Ranker which picks the normalized mentions as opposed to the absolute values.

@thobson88
Copy link
Collaborator Author

TODO: Split the DeezyMatchRanker into two subclasses: Pretrained and Custom. Only the latter needs training parameters and resources. The former just needs the path to the trained model.

@thobson88
Copy link
Collaborator Author

TODO: implement and document dataclasses for Recogniser, Ranker, Linker run method output types. This is a big change so there's a separate ticket for the dataclass structure.

@thobson88
Copy link
Collaborator Author

thobson88 commented Nov 21, 2024

Ranker class structure (see udpated version below):

T-Res-ranker-classes drawio

Key: see this link

@thobson88
Copy link
Collaborator Author

Pipeline refactoring

For discussion at the PR review stage: these lines in pipeline.py:

                if not len(m["mention"]) == 1 and not m["mention"].islower():
                    mentions_dataset.append(prediction)

have the effect of filtering out (i.e. discarding) any toponym mentions made up of only lowercase characters.

Given the issues with OCR quality, I don't think this is desired behaviour. Therefore this filter has been removed in the refactored pipeline.

Note: this change impacts an assertion in the test_modular_deezy_rel test case inside test_pipeline.py.

@thobson88
Copy link
Collaborator Author

Example notebooks

All have been updated to work with the refactored codebase, with the exception of:

  • train_use_deezy_model_1.ipynb
  • train_use_deezy_model_2.ipynb
  • train_use_deezy_model_2.ipynb
  • train_use_ner_model.ipynb

These notebooks are giving the same errors on the dev branch, so are not broken as a result of the refactoring. Therefore I'm leaving them aside for now.

@thobson88
Copy link
Collaborator Author

Linker refactoring

Unlike the other components of the pipeline, the Linker has an extra "business logic" method named disambiguate, in addition to the usual run method.

The reason for this is that the run method identifies candidate links for a specific mention whereas, in general, the disambiguation step requires access to all of the candidates/predictions (for the whole text) at the same time. So the inputs these two operations are quite different.

Previously the disambiguation step was (in the case of a REL model) being done inside the Pipeline, not the Linker. But this meant wasn't ideal because: i) the Pipeline included disambiguation business logic and ii) it had to discriminate between the different linking methods. By moving this logic into the Linker subclasses, both of these problems are avoided.

@thobson88
Copy link
Collaborator Author

thobson88 commented Dec 19, 2024

Recogniser class structure:

T-Res-recogniser-classes drawio

The run method:

def run(self, sentence: str) -> SentenceMentions:

replaces the previous ner_predict method:

def ner_predict(self, sentence: str) -> List[dict]:

@thobson88
Copy link
Collaborator Author

thobson88 commented Dec 19, 2024

Ranker class structure:

T-Res-ranker-classes drawio

The run method (overridden in each subclass):

def run(self, mention: Mention) -> CandidateMatches:

replaces the previous methods:

def perfect_match(self, queries: List[str]) -> Tuple[dict, dict]:
def partial_match(self, queries: List[str], damlev: bool) -> Tuple[dict, dict]:
def deezy_on_the_fly(self, queries: List[str]) -> Tuple[dict, dict]:

@thobson88
Copy link
Collaborator Author

thobson88 commented Dec 19, 2024

Linker class structure:

T-Res-linker-classes drawio

The run method:

def run(
        self, 
        matches: CandidateMatches, 
        place_of_pub_wqid: Optional[str]=None,
        place_of_pub: Optional[str]=None,
    ) -> MentionCandidates:

replaces previous methods:

def most_popular(self, dict_mention: dict) -> Tuple[str, float, dict]:
def by_distance(self, dict_mention: dict, origin_wqid: Optional[str] = "") -> Tuple[str, float, dict]:

The disambiguate and disambiguation_scores methods (overridden in each subclass) compute disambiguation scores:

def disambiguate(self, candidates: List[SentenceCandidates]) -> Predictions:

replacing, in the case of the reldisamb linking method, logic that was in the Pipeline and in the rank_candidates function in rel_utils.py:

def rank_candidates(rel_json: dict, wk_cands: dict) -> dict:

@thobson88
Copy link
Collaborator Author

Pipeline class structure:

T-Res-pipeline-classes drawio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant