Xtractor

XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approach (from readability.js, goose and snacktory) to create a set of heuristics for fast article text extraction. It adds several features like paragraph preservation, better image detection heuristics, sibling score based enhancements to article detection
Alternatives To Xtractor
Select To Compare


Popular Learning Resources Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Java
Article