Hi everyone! I'm wanting to crawl through all of gopherspace and archive it every so often while also respecting robots.txt and users which don't want
to be apart of the archive. What are some good resources on crawling through the internet other than the web? After searching a bit every crawl tutorial
I see is about crawling through the web, not other protocols.
Any info is appreciated, thank you!
2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:
Hi everyone! I'm wanting to crawl through all of gopherspace and
archive it every so often while also respecting robots.txt and users
which don't want to be apart of the archive. What are some good
resources on crawling through the internet other than the web? After searching a bit every crawl tutorial I see is about crawling through
the web, not other protocols.
Any info is appreciated, thank you! ^^
Instead of inventing you're own wheel, you could perhaps extend an
existing project.
OGUP is a simple engine that crawls the gophernet, only to keep a list
of active servers (and possibly discover new servers). In the process
it collects the content of directories, but throw it away. It would be relatively easy to make it write the content of directories into some database (relational or filesystem-based) for archiving purposes.
Then, extend the archiving activities to text files. I'd gladly
provide you with pointers if you'd like to work on that.
OGUP is written in C89.
Hi everyone! I'm wanting to crawl through all of gopherspace and
archive it every so often while also respecting robots.txt and users
which don't want to be apart of the archive. What are some good
resources on crawling through the internet other than the web? After searching a bit every crawl tutorial I see is about crawling through
the web, not other protocols.
Any info is appreciated, thank you! ^^
2020-12-21 at 08:33 +0100, Mateusz Viste wrote:
2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:
OGUP is written in C89.
gopher://gopher.viste.fr/1/ogup/
Hi everyone! I'm wanting to crawl through all of gopherspace and
archive it every so often while also respecting robots.txt and users
which don't want to be apart of the archive. What are some good
resources on crawling through the internet other than the web? After searching a bit every crawl tutorial I see is about crawling through the web, not other protocols.
On 2020-12-21, Mateusz Viste <mateusz@xyz.invalid> wrote:
2020-12-21 at 08:33 +0100, Mateusz Viste wrote:
2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:
OGUP is written in C89.
gopher://gopher.viste.fr/1/ogup/
Oh no! I'm not part of the observable universe!
Perhaps some day gopher.brokenco.de will be seen ;)
Hi everyone! I'm wanting to crawl through all of gopherspace and archive it >> every so often while also respecting robots.txt and users which don't want >> to be apart of the archive. What are some good resources on crawling through >> the internet other than the web? After searching a bit every crawl tutorial >> I see is about crawling through the web, not other protocols.
Any info is appreciated, thank you!
You might want to search the archives and/or re-post on the gopher-project list:
https://lists.debian.org/gopher-project/
I believe there are already a few subscribers that are doing periodic Gopherspace crawls, mostly for keeping search engine DBs updated.
-M4
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 285 |
Nodes: | 16 (2 / 14) |
Uptime: | 63:37:47 |
Calls: | 6,488 |
Calls today: | 1 |
Files: | 12,096 |
Messages: | 5,274,684 |