chiphuyen/lazynlp

lazynlp, на сайте с 14 июня 2023, 13:56
A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this library, you should be able to create datasets larger than the one used by OpenAI for GPT-2.