-
Notifications
You must be signed in to change notification settings - Fork 13
feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces. #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
75f6ff7
a2075be
3664576
65ea764
07c85be
4e3feb9
eca18ea
08267ab
01c7898
872a98a
69406bf
e83d8b4
c67fb32
68ee430
ac94a02
ad07225
2f30634
7c4a071
2afc942
86a155e
e402458
d90755c
a4251c7
ad658b0
99793b7
1c9f0b2
220a734
83246a9
d3eebe4
e759245
6383db7
1cf6b6a
6650fd4
aebf9a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # MedCAT Embedding Linker | ||
|
|
||
| A MedCAT plugin that provides an a Rawstring tokenizer, essentially splitting on whitespace characters (" ", "\n", "\t") only. | ||
|
|
||
| ## Overview | ||
|
|
||
| This plugin replaces MedCAT's default tokenizing components with with rawstring, that are not limited by requiring SpaCy representations that perform linking. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - **MedCAT**: 2.0+ ([PyPI](https://pypi.org/project/medcat/) | [GitHub](https://github.com/CogStack/MedCAT)) | ||
| - Python 3.10+ | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install medcat-rawstring-tokenizer | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Replacing current tokenizer with a rawstring_tokenizer | ||
|
|
||
| ```python | ||
| from medcat.cat import CAT | ||
| from medcat_rawstring_tokenizer.tokenizer import RawstringTokenizer | ||
| from medcat.tokenizing.tokenizers import register_tokenizer | ||
|
|
||
| MODEL_PACK_PATH = ".." | ||
| TARGET_FOLDER = ".." | ||
| TARGET_PACK_NAME = ".." | ||
| TOKENIZER_NAME = "rawstring_tokenizer" | ||
|
|
||
| # The custom tokenizer must be registered before we rebuild the pipeline. | ||
| register_tokenizer(TOKENIZER_NAME, RawstringTokenizer) | ||
|
|
||
| cat = CAT.load_model_pack(MODEL_PACK_PATH) | ||
| print("Tokenizer provider before:", cat.config.general.nlp.provider) | ||
|
|
||
| # Switch tokenizer provider in config, then recreate pipeline to apply it. | ||
| cat.config.general.nlp.provider = TOKENIZER_NAME | ||
|
|
||
| cat.config.components.addons.clear() | ||
| cat._recreate_pipe() | ||
|
|
||
| print("Tokenizer provider after:", cat.config.general.nlp.provider) | ||
|
|
||
| cat.save_model_pack( | ||
| target_folder=TARGET_FOLDER, | ||
| pack_name=TARGET_PACK_NAME, | ||
| add_hash_to_pack_name=False, | ||
| make_archive=False, | ||
| ) | ||
| print("Saved model pack to:", f"{TARGET_FOLDER.rstrip('/')}/{TARGET_PACK_NAME}") | ||
| ``` | ||
|
|
||
| ## How It Works | ||
|
|
||
| ### Component Registration | ||
|
|
||
| Register the tokenizer by name before trying to add the tokenizer to the pipeline. If loading a model with a rawstring tokenizer register it beforehand. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd normally expect the extension to take care of that. EDIT: |
||
|
|
||
| ### Embedding Generation | ||
|
|
||
| ## Limitations | ||
|
|
||
| - Can NOT be used with the default `context_based_linker` as, that uses spacy tokens and spacy embeddings for linking. Which are not used with this tokenizer. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not quite sure I understand why it can't be used with the context based linker. |
||
|
|
||
| ## Citation | ||
|
|
||
| If you use this plugin, please cite MedCAT: | ||
|
|
||
| ```bibtex | ||
| @article{medcat2021, | ||
| title={Medical Concept Annotation Tool (MedCAT)}, | ||
| author={Kraljevic, Zeljko and et al.}, | ||
| journal={arXiv preprint arXiv:2010.01165}, | ||
| year={2021} | ||
| } | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it differ from the regex tokenizer?
Just a different regex?
What's the benefit?
Why do we need this separately from the built in regex tokenizer?
Also, would be great to add a workflow that runs some typing/linting and tests on this stuff.