chiphuyen/lazynlp

lazynlp, на сайте с June 14, 2023 13:56
A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this library, you should be able to create datasets larger than the one used by OpenAI for GPT-2.