Skip to content

feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550

Open
adam-sutton-1992 wants to merge 34 commits into
mainfrom
feat(TransformerNER)_trainable_bert_ner
Open

feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550
adam-sutton-1992 wants to merge 34 commits into
mainfrom
feat(TransformerNER)_trainable_bert_ner

Conversation

@adam-sutton-1992

@adam-sutton-1992 adam-sutton-1992 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Hihi,

This is WIP so we'll top it off with a TODO here for now:

  • Unit Testing for transformer_NER - DONE!
  • Unit Testing for rawstring_tokenizer - DONE!
  • Testing all variants of components that are viable (aka all except RawStringTokenizer + ContextBasedLinking) - DONE!
  • Parameter testing - DONE!
  • README for transformer_ner - DONE!
  • README for rawstring_tokenizer - DONE!
  • Whatever mypy and linting fails I missed :D - DONE!

This is the trainable MLM transformer model attempting to do NER. It is a binary BIOES NER model where each prediction is either (Beginning-Ent, Inside-Ent, Outside-Ent, End-Ent, Single-Ent). This is a bit of an advancement compared to BIO models, where E signals the end of a multi token label, and S signals a stand alone token label. We try to prioritise B and E tokens here for performance (i.e. ensure we get well formed predictions). We also have a CRF head after the MLM model to try to encourge more well formed label predictions (i.e. only I and E after B

transformer_ner.py is the main logic for the plugin, while transformer_ner_model is the logic for the model (such as initialisation, loading, and the forward step).

Rawstring_tokenizer is a tokenizer where all tokens are based on whitespace splits i.e. new lines, tabs, and spaces. It still can't perfectly obtain all entities (sub word entities). But is an improvement for where some entities don't have spacy representation. This mainly improves performances in pipelines where it's using transformer_ner, and embedding_linker.

There are also additional changes to the embedding_linker. I understand these should probably be seperate, however that's slipped through the cracks. Apologies. The changes are mainly more functionality and configurability:

  1. Multiple entities per span
  2. Short and long similarity thresholds, along with top_k entities being passed
  3. Appending the pre_inference link candidates to those from vocab based methods (which are both a part of transfomer_ner and vocab_ner).
  4. Additional documentation in the code for all of this step by step.

Performance wise you can expect performances of trained models with reasonable configs to look like this (based on training / testing of Distemist & Snomed Entity Linking Benchmark):

  1. Spacy Tokenizer + Vocab based NER + Context Based Linker: Recall 0.7
  2. Spacy Tokenizer + Vocab based NER + Embedding Linking: Recall 0.75
  3. Spacy Tokenizer + Transformer based NER + Context Based Linking: Recall 0.73
  4. Rawstring Tokenizer + Transformer based NER + Embedding Linking: Recall 0.85

There a few additional pieces with these metrics. The embedding linker is highly configurable so you can go from a Recall of 0.84-ish and a Precision of 0.4, to 0.9 recall and 0.05 precision. These metrics I have here are essentially on configurations I think make "sense". One such measure of that metric is "if the recall goes up, and precision remains the same improves I'd consider that a solid improvement". I have documented these changes in performance within the config, so hopefully people can make informed decisions.

adam-sutton-1992 and others added 30 commits April 24, 2026 12:39
@adam-sutton-1992 adam-sutton-1992 changed the title WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces. feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces. Jun 23, 2026
@adam-sutton-1992

Copy link
Copy Markdown
Contributor Author

Hi, just a brief mention on performances. I'll list some permutations of components, tokenisers and just list what I can get. These aren't exhaustive and the performance can always be improved with better configs.

  • Baseline (with training): 0.68 Recall

  • Baseline (without training): 0.34 Recall

  • Rawstring Tokenizer, , Context Based Linker (CBL): Not possible!

  • Rawstring Tokenizer, TRF NER, Trained Embedding Linker (TEL): 0.85 Recall

  • Rawstring Tokenizer, TRF NER, Static Embedding Linker (SEL): 0.54 Recall

  • SpacyTokenizer, TRF NER, SEL: 0.48 Recall

  • SpacyTokenizer, TRF NER, TEL: 0.7 Recall

  • SpacyTokenizer, TRF NER, CBL: 0.69 Recall

  • SpacyTokenizer, Vocab NER, SEL: 0.35 Recall

  • SpacyTokenizer, Vocab NER, TEL: 0.75 Recall

@mart-r

mart-r commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Hi, just a brief mention on performances. I'll list some permutations of components, tokenisers and just list what I can get. These aren't exhaustive and the performance can always be improved with better configs.

  • Baseline (with training): 0.68 Recall
  • Baseline (without training): 0.34 Recall
  • Rawstring Tokenizer, , Context Based Linker (CBL): Not possible!
  • Rawstring Tokenizer, TRF NER, Trained Embedding Linker (TEL): 0.85 Recall
  • Rawstring Tokenizer, TRF NER, Static Embedding Linker (SEL): 0.54 Recall
  • SpacyTokenizer, TRF NER, SEL: 0.48 Recall
  • SpacyTokenizer, TRF NER, TEL: 0.7 Recall
  • SpacyTokenizer, TRF NER, CBL: 0.69 Recall
  • SpacyTokenizer, Vocab NER, SEL: 0.35 Recall
  • SpacyTokenizer, Vocab NER, TEL: 0.75 Recall

As much as I agere that recall is often more important than precision, on it's own I don't see it as particularly valuable. I could just mark all spans for all concepts and get a recall of 1 at the expense of zero precision.
So maybe add some idea of precision as well?
Also, how long do these take to run? Is there any throughput changes?
And what dataset are you running this on?

PS:
Will look at the PR itself later on, just saw this as a separate thread on Gmail...

@mart-r mart-r left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive PR!
This should have been 3 different PRs:

  1. Embedding linker changes
  2. Rawstring tokenizer
  3. Transformers based NER

I really couldn't go through everything in enough detail at this point.
But I've left a few comments.

I think the main questions I had were the following:

  1. What's rawstring tokenizer and how does it differ from the built in regex tokenizer? Why do we need a separate one?
  2. Why can't the rawstring tokenizer be used with the regular / context based linker?

And then there's the fact that I think we need to have workflows for the new things as well.

index: int,
char_index: int,
end_char_index: int) -> None:
# --- BaseToken fields ---

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but to avoid all the boilerplate, I would just use the .text (and so on) fields rather than the property stuff below.

It would still satisfy the protocol. The only difference is that the user doesn't know that these are writable, and thus type checkers would complain if/when try try to write into these.


def __init__(self, text: str, start_index: int, end_index: int,
start_char: int, end_char: int, label: str = "") -> None:
# --- BaseEntity fields ---

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - I would juse use fields instead of properties.

# _WORD_RE = re.compile(r"[^\W_]+(?:[-/][^\W_]+)*", re.UNICODE)


def _iter_word_spans(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively where the tokenization itself happens :)

Perhaps move this (along with the regex above) to its own module to separate it? i.e base.py


### Component Registration

Register the tokenizer by name before trying to add the tokenizer to the pipeline. If loading a model with a rawstring tokenizer register it beforehand.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,80 @@
# MedCAT Embedding Linker

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it differ from the regex tokenizer?
Just a different regex?
What's the benefit?
Why do we need this separately from the built in regex tokenizer?

Also, would be great to add a workflow that runs some typing/linting and tests on this stuff.


## Limitations

- Can NOT be used with the default `context_based_linker` as, that uses spacy tokens and spacy embeddings for linking. Which are not used with this tokenizer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand why it can't be used with the context based linker.
There is nothing in there that is coupled to the spacy tokenizer.

@@ -0,0 +1,100 @@
# MedCAT Embedding Linker

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also use workflows that run linting/typing/tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants