feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces. by adam-sutton-1992 · Pull Request #550 · CogStack/cogstack-nlp

adam-sutton-1992 · 2026-06-16T18:34:29Z

Hihi,

This is WIP so we'll top it off with a TODO here for now:

Unit Testing for transformer_NER - DONE!
Unit Testing for rawstring_tokenizer - DONE!
Testing all variants of components that are viable (aka all except RawStringTokenizer + ContextBasedLinking) - DONE!
Parameter testing - DONE!
README for transformer_ner - DONE!
README for rawstring_tokenizer - DONE!
Whatever mypy and linting fails I missed :D - DONE!

This is the trainable MLM transformer model attempting to do NER. It is a binary BIOES NER model where each prediction is either (Beginning-Ent, Inside-Ent, Outside-Ent, End-Ent, Single-Ent). This is a bit of an advancement compared to BIO models, where E signals the end of a multi token label, and S signals a stand alone token label. We try to prioritise B and E tokens here for performance (i.e. ensure we get well formed predictions). We also have a CRF head after the MLM model to try to encourge more well formed label predictions (i.e. only I and E after B

transformer_ner.py is the main logic for the plugin, while transformer_ner_model is the logic for the model (such as initialisation, loading, and the forward step).

Rawstring_tokenizer is a tokenizer where all tokens are based on whitespace splits i.e. new lines, tabs, and spaces. It still can't perfectly obtain all entities (sub word entities). But is an improvement for where some entities don't have spacy representation. This mainly improves performances in pipelines where it's using transformer_ner, and embedding_linker.

There are also additional changes to the embedding_linker. I understand these should probably be seperate, however that's slipped through the cracks. Apologies. The changes are mainly more functionality and configurability:

Multiple entities per span
Short and long similarity thresholds, along with top_k entities being passed
Appending the pre_inference link candidates to those from vocab based methods (which are both a part of transfomer_ner and vocab_ner).
Additional documentation in the code for all of this step by step.

Performance wise you can expect performances of trained models with reasonable configs to look like this (based on training / testing of Distemist & Snomed Entity Linking Benchmark):

Spacy Tokenizer + Vocab based NER + Context Based Linker: Recall 0.7
Spacy Tokenizer + Vocab based NER + Embedding Linking: Recall 0.75
Spacy Tokenizer + Transformer based NER + Context Based Linking: Recall 0.73
Rawstring Tokenizer + Transformer based NER + Embedding Linking: Recall 0.85

There a few additional pieces with these metrics. The embedding linker is highly configurable so you can go from a Recall of 0.84-ish and a Precision of 0.4, to 0.9 recall and 0.05 precision. These metrics I have here are essentially on configurations I think make "sense". One such measure of that metric is "if the recall goes up, and precision remains the same improves I'd consider that a solid improvement". I have documented these changes in performance within the config, so hopefully people can make informed decisions.

…rainable_bert_ner

…creation

…sed training

…izer

…trainable_bert_ner

adam-sutton-1992 · 2026-06-23T14:13:29Z

Hi, just a brief mention on performances. I'll list some permutations of components, tokenisers and just list what I can get. These aren't exhaustive and the performance can always be improved with better configs.

Baseline (with training): 0.68 Recall
Baseline (without training): 0.34 Recall
Rawstring Tokenizer, , Context Based Linker (CBL): Not possible!
Rawstring Tokenizer, TRF NER, Trained Embedding Linker (TEL): 0.85 Recall
Rawstring Tokenizer, TRF NER, Static Embedding Linker (SEL): 0.54 Recall
SpacyTokenizer, TRF NER, SEL: 0.48 Recall
SpacyTokenizer, TRF NER, TEL: 0.7 Recall
SpacyTokenizer, TRF NER, CBL: 0.69 Recall
SpacyTokenizer, Vocab NER, SEL: 0.35 Recall
SpacyTokenizer, Vocab NER, TEL: 0.75 Recall

mart-r · 2026-06-23T14:46:55Z

Hi, just a brief mention on performances. I'll list some permutations of components, tokenisers and just list what I can get. These aren't exhaustive and the performance can always be improved with better configs.

Baseline (with training): 0.68 Recall

Baseline (without training): 0.34 Recall

Rawstring Tokenizer, , Context Based Linker (CBL): Not possible!

Rawstring Tokenizer, TRF NER, Trained Embedding Linker (TEL): 0.85 Recall

Rawstring Tokenizer, TRF NER, Static Embedding Linker (SEL): 0.54 Recall

SpacyTokenizer, TRF NER, SEL: 0.48 Recall

SpacyTokenizer, TRF NER, TEL: 0.7 Recall

SpacyTokenizer, TRF NER, CBL: 0.69 Recall

SpacyTokenizer, Vocab NER, SEL: 0.35 Recall

SpacyTokenizer, Vocab NER, TEL: 0.75 Recall

As much as I agere that recall is often more important than precision, on it's own I don't see it as particularly valuable. I could just mark all spans for all concepts and get a recall of 1 at the expense of zero precision.
So maybe add some idea of precision as well?
Also, how long do these take to run? Is there any throughput changes?
And what dataset are you running this on?

PS:
Will look at the PR itself later on, just saw this as a separate thread on Gmail...

mart-r

Massive PR!
This should have been 3 different PRs:

Embedding linker changes
Rawstring tokenizer
Transformers based NER

I really couldn't go through everything in enough detail at this point.
But I've left a few comments.

I think the main questions I had were the following:

What's rawstring tokenizer and how does it differ from the built in regex tokenizer? Why do we need a separate one?
Why can't the rawstring tokenizer be used with the regular / context based linker?

And then there's the fact that I think we need to have workflows for the new things as well.

mart-r · 2026-06-26T15:25:03Z

+                 index: int, 
+                 char_index: int, 
+                 end_char_index: int) -> None:
+        # --- BaseToken fields ---


This is fine, but to avoid all the boilerplate, I would just use the .text (and so on) fields rather than the property stuff below.

It would still satisfy the protocol. The only difference is that the user doesn't know that these are writable, and thus type checkers would complain if/when try try to write into these.

mart-r · 2026-06-26T15:25:35Z

+
+    def __init__(self, text: str, start_index: int, end_index: int,
+                 start_char: int, end_char: int, label: str = "") -> None:
+        # --- BaseEntity fields ---


Same here - I would juse use fields instead of properties.

mart-r · 2026-06-26T15:32:43Z

+# _WORD_RE = re.compile(r"[^\W_]+(?:[-/][^\W_]+)*", re.UNICODE)
+
+
+def _iter_word_spans(


This is effectively where the tokenization itself happens :)

Perhaps move this (along with the regex above) to its own module to separate it? i.e base.py

mart-r · 2026-06-26T15:35:40Z

+
+### Component Registration
+
+Register the tokenizer by name before trying to add the tokenizer to the pipeline. If loading a model with a rawstring tokenizer register it beforehand.


I'd normally expect the extension to take care of that.
See:
https://github.com/CogStack/cogstack-nlp/blob/main/medcat-plugins/embedding-linker/src/medcat_embedding_linker/__init__.py
and
https://github.com/CogStack/cogstack-nlp/blob/main/medcat-plugins/embedding-linker/src/medcat_embedding_linker/registration.py
for example

EDIT:
I see you've already done that for transformers NER as well

mart-r · 2026-06-26T15:36:59Z

@@ -0,0 +1,80 @@
+# MedCAT Embedding Linker


How does it differ from the regex tokenizer?
Just a different regex?
What's the benefit?
Why do we need this separately from the built in regex tokenizer?

Also, would be great to add a workflow that runs some typing/linting and tests on this stuff.

mart-r · 2026-06-26T15:38:28Z

+
+## Limitations
+
+- Can NOT be used with the default `context_based_linker` as, that uses spacy tokens and spacy embeddings for linking. Which are not used with this tokenizer.


I'm not quite sure I understand why it can't be used with the context based linker.
There is nothing in there that is coupled to the spacy tokenizer.

mart-r · 2026-06-26T15:42:27Z

@@ -0,0 +1,100 @@
+# MedCAT Embedding Linker


This could also use workflows that run linting/typing/tests.

adam-sutton-1992 and others added 30 commits April 24, 2026 12:39

fixing the comp name requirement

75f6ff7

initial commit for transformer_ner

a2075be

changed cui embedding method and fixed mention_mask generation

3664576

fixed spacing

65ea764

Merge branch 'embedding_cui_longest_name' into feat(TransformerNER)_t…

07c85be

…rainable_bert_ner

CU-869d9n2rg: Avoid running pipe twice for supervised training

4e3feb9

CU-869d9n2rg: Fix typo

eca18ea

CU-869d9n2rg: Use separate callers for tokenizer and pipe in trainer

08267ab

CU-869d9n2rg: Use separate callers for tokenizer and pipe in trainer …

01c7898

…creation

CU-869d9n2rg: Fix training time instantiation

872a98a

CU-869d9n2rg: Set entity name when preparing for supervised training

69406bf

CU-869d9n2rg: Fix issue with trainable component counting

e83d8b4

CU-869d9n2rg: Add cui to entity for supervised training

c67fb32

CU-869d9n2rg: Using properly processed name when prepping for supervi…

68ee430

…sed training

CU-869d9n2rg: Simplify trainer class - use pipe for different callers

ac94a02

CU-869d9n2rg: Simplify trainer class in init

ad07225

CU-869d9n2rg: Fix trainer utils tests

2f30634

CU-869d9n2rg: Update trainer tests with new trainer object

7c4a071

CU-869d9n2rg: Update trainer utils tests with new trainer object

2afc942

CU-869d9n2rg: Remove unused import

86a155e

Merge branch 'pr-467' into feat(TransformerNER)_trainable_bert_ner

e402458

changes to embedding linker and transformer ner, along with new token…

d90755c

…izer

Merge remote-tracking branch 'origin/main' into feat(TransformerNER)_…

a4251c7

…trainable_bert_ner

progress on linking and ner models

ad658b0

merge with main

99793b7

embedding linker fixes / adaptations and config updates

1c9f0b2

mypy fixes that are never enough

220a734

linting!

83246a9

remove debugging print statements

d3eebe4

fixing remote mypy issues and errors thrown in running

e759245

adam-sutton-1992 added 4 commits June 16, 2026 22:17

fixed tests!

6383db7

testing and related fixes due to tests

1cf6b6a

re-added deid transformer-ner type

6650fd4

fixes from integration testing and READMEs done

aebf9a8

adam-sutton-1992 changed the title ~~WIP: feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.~~ feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces. Jun 23, 2026

mart-r requested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550

feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550
adam-sutton-1992 wants to merge 34 commits into
mainfrom
feat(TransformerNER)_trainable_bert_ner

adam-sutton-1992 commented Jun 16, 2026 •

edited

Loading

Uh oh!

adam-sutton-1992 commented Jun 23, 2026

Uh oh!

mart-r commented Jun 23, 2026

Uh oh!

mart-r left a comment

Uh oh!

mart-r Jun 26, 2026

Uh oh!

mart-r Jun 26, 2026

Uh oh!

mart-r Jun 26, 2026

Uh oh!

mart-r Jun 26, 2026

Uh oh!

mart-r Jun 26, 2026

Uh oh!

mart-r Jun 26, 2026

Uh oh!

mart-r Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# _WORD_RE = re.compile(r"[^\W_]+(?:[-/][^\W_]+)*", re.UNICODE)


		def _iter_word_spans(


		### Component Registration

		Register the tokenizer by name before trying to add the tokenizer to the pipeline. If loading a model with a rawstring tokenizer register it beforehand.


		## Limitations

		- Can NOT be used with the default `context_based_linker` as, that uses spacy tokens and spacy embeddings for linking. Which are not used with this tokenizer.

Uh oh!

Conversation

adam-sutton-1992 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-sutton-1992 commented Jun 23, 2026

Uh oh!

mart-r commented Jun 23, 2026

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adam-sutton-1992 commented Jun 16, 2026 •

edited

Loading