Amass
amass
is a high-throughput enterprise-grade web-crawler that crawls specific URLs. It can be used to fetch
millions of url’s per hour. amass
is different than other crawlers like crawler4j
that it does not crawl
the nested pages. Instead it just gathers and collects various URLs as supplied. Thus, it amasses specific
data from the internet, and hence the name amass
.
Features
- Enterprise-grade: crawl milions of URLs without worry
- A priority based queue for crawling urgent URLs faster
- Support for pre-crawl and post-crawl handler
- Mechanism to prevent crawling via the pre-crawl handler
- Support for multiple submission of a URL, which increase its priority
- Nano-time accuracy for ordering when priority is the same
License
The library is released under the terms of Apache Public License Version 2.