• (Respectively) crawling through gopherspace?

    From Donnie Corbitt@21:1/5 to All on Wed Dec 16 06:26:33 2020
    Hi everyone! I'm wanting to crawl through all of gopherspace and archive it every so often while also respecting robots.txt and users which don't want to be apart of the archive. What are some good resources on crawling through the internet other than
    the web? After searching a bit every crawl tutorial I see is about crawling through the web, not other protocols.

    Any info is appreciated, thank you! ^^

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From met@ph.or@21:1/5 to All on Sat Dec 19 19:44:16 2020
    Hi everyone! I'm wanting to crawl through all of gopherspace and archive it every so often while also respecting robots.txt and users which don't want
    to be apart of the archive. What are some good resources on crawling through the internet other than the web? After searching a bit every crawl tutorial
    I see is about crawling through the web, not other protocols.

    Any info is appreciated, thank you!

    You might want to search the archives and/or re-post on the gopher-project list:

    https://lists.debian.org/gopher-project/

    I believe there are already a few subscribers that are doing periodic Gopherspace crawls, mostly for keeping search engine DBs updated.

    -M4

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to Mateusz Viste on Mon Dec 21 08:34:13 2020
    2020-12-21 at 08:33 +0100, Mateusz Viste wrote:
    2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:
    Hi everyone! I'm wanting to crawl through all of gopherspace and
    archive it every so often while also respecting robots.txt and users
    which don't want to be apart of the archive. What are some good
    resources on crawling through the internet other than the web? After searching a bit every crawl tutorial I see is about crawling through
    the web, not other protocols.

    Any info is appreciated, thank you! ^^

    Instead of inventing you're own wheel, you could perhaps extend an
    existing project.

    OGUP is a simple engine that crawls the gophernet, only to keep a list
    of active servers (and possibly discover new servers). In the process
    it collects the content of directories, but throw it away. It would be relatively easy to make it write the content of directories into some database (relational or filesystem-based) for archiving purposes.
    Then, extend the archiving activities to text files. I'd gladly
    provide you with pointers if you'd like to work on that.

    OGUP is written in C89.

    gopher://gopher.viste.fr/1/ogup/

    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to Donnie Corbitt on Mon Dec 21 08:33:43 2020
    2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:
    Hi everyone! I'm wanting to crawl through all of gopherspace and
    archive it every so often while also respecting robots.txt and users
    which don't want to be apart of the archive. What are some good
    resources on crawling through the internet other than the web? After searching a bit every crawl tutorial I see is about crawling through
    the web, not other protocols.

    Any info is appreciated, thank you! ^^

    Instead of inventing you're own wheel, you could perhaps extend an
    existing project.

    OGUP is a simple engine that crawls the gophernet, only to keep a list
    of active servers (and possibly discover new servers). In the process
    it collects the content of directories, but throw it away. It would be relatively easy to make it write the content of directories into some
    database (relational or filesystem-based) for archiving purposes. Then,
    extend the archiving activities to text files. I'd gladly provide you
    with pointers if you'd like to work on that.

    OGUP is written in C89.


    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From rtyler@21:1/5 to Mateusz Viste on Mon Dec 21 19:29:06 2020
    On 2020-12-21, Mateusz Viste <mateusz@xyz.invalid> wrote:
    2020-12-21 at 08:33 +0100, Mateusz Viste wrote:
    2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:

    OGUP is written in C89.

    gopher://gopher.viste.fr/1/ogup/



    Oh no! I'm not part of the observable universe!

    Perhaps some day gopher.brokenco.de will be seen ;)



    --
    --
    GitHub: https://github.com/rtyler

    GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dennis Boone@21:1/5 to All on Mon Dec 21 19:29:10 2020
    Hi everyone! I'm wanting to crawl through all of gopherspace and
    archive it every so often while also respecting robots.txt and users
    which don't want to be apart of the archive. What are some good
    resources on crawling through the internet other than the web? After searching a bit every crawl tutorial I see is about crawling through the web, not other protocols.

    https://en.wikipedia.org/wiki/Veronica_(search_engine) https://en.wikipedia.org/wiki/Jughead_(search_engine)

    De

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to rtyler on Tue Dec 22 08:19:50 2020
    2020-12-21 at 19:29 -0000, rtyler wrote:
    On 2020-12-21, Mateusz Viste <mateusz@xyz.invalid> wrote:
    2020-12-21 at 08:33 +0100, Mateusz Viste wrote:
    2020-12-16 at 06:26 -0800, Donnie Corbitt wrote:

    OGUP is written in C89.

    gopher://gopher.viste.fr/1/ogup/


    Oh no! I'm not part of the observable universe!
    Perhaps some day gopher.brokenco.de will be seen ;)

    The crawler is temporarily offline due to... me not having much time to
    take care of it (and I need to find a server to run it on since the
    previous server died). I will add brokenco.de as soon as the crawler
    starts running again.

    Would be nice to have a "suggest a new server" feature as well some
    day. And robots.txt support. So many things to do, so little time...

    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Daniel@21:1/5 to met@ph.or on Mon Jan 18 02:50:51 2021
    met@ph.or writes:

    Hi everyone! I'm wanting to crawl through all of gopherspace and archive it >> every so often while also respecting robots.txt and users which don't want >> to be apart of the archive. What are some good resources on crawling through >> the internet other than the web? After searching a bit every crawl tutorial >> I see is about crawling through the web, not other protocols.

    Any info is appreciated, thank you!

    You might want to search the archives and/or re-post on the gopher-project list:

    https://lists.debian.org/gopher-project/

    I believe there are already a few subscribers that are doing periodic Gopherspace crawls, mostly for keeping search engine DBs updated.

    -M4

    There's also a gopher project irc channel on freenode. Here are the
    channels I join regularly:

    #gopher
    #gopherproject
    #gophernicus

    --
    Daniel
    Visit me at: gopher://gcpp.world

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)