rubert-tiny2
на сайте с May 04, 2023 18:25
This is an updated version of cointegrated/rubert-tiny: a small Russian BERT-based encoder with high-quality sentence embeddings. This post in Russian gives more details.
The differences from the previous version include:
a larger vocabulary: 83828 tokens instead of 29564;
larger supported sequences: 2048 instead of 512;
sentence embeddings approximate LaBSE closer than before;
meaningful segment embeddings (tuned on the NLI task)
the model is focused only on Russian.