wizenoze/justext-java

Java, на сайте с June 14, 2023 13:55
justext-java is a library for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. This implementation is the Java port of https://github.com/miso-belica/jusText.