• From JoyceUlysses.txt -- words occurring exactly once

    From HenHanna@21:1/5 to All on Thu May 30 13:09:39 2024
    XPost: comp.lang.scheme

    i'd not use Gauche for this, but maybe someone can change my mind.


    _______________________
    From JoyceUlysses.txt -- words occurring exactly once


    Given a text file of a novel (JoyceUlysses.txt) ...

    could someone give me a pretty fast (and simple) program that'd give me
    a list of all words occurring exactly once?

    -- Also, a list of words occurring once, twice or 3 times



    re: hyphenated words (you can treat it anyway you like)

    ideally, i'd treat [editor-in-chief]
    [go-ahead] [pen-knife]
    [know-how] [far-fetched] ...
    as one unit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeff Barnett@21:1/5 to All on Thu May 30 16:33:30 2024
    XPost: comp.lang.scheme

    T24gNS8zMC8yMDI0IDI6MDkgUE0sIEhlbkhhbm5hIHdyb3RlOg0KPiANCj4gaSdkIG5vdCB1 c2UgR2F1Y2hlIGZvciB0aGlzLCBidXQgbWF5YmUgc29tZW9uZSBjYW4gY2hhbmdlIG15IG1p bmQuDQo+IA0KPiANCj4gX19fX19fX19fX19fX19fX19fX19fX18NCj4gIEZyb20gSm95Y2VV bHlzc2VzLnR4dCAtLSB3b3JkcyBvY2N1cnJpbmcgZXhhY3RseSBvbmNlDQo+IA0KPiANCj4g R2l2ZW4gYSB0ZXh0IGZpbGUgb2YgYSBub3ZlbCAoSm95Y2VVbHlzc2VzLnR4dCkgLi4uDQo+ IA0KPiBjb3VsZCBzb21lb25lIGdpdmUgbWUgYSBwcmV0dHkgZmFzdCAoYW5kIHNpbXBsZSkg cHJvZ3JhbSB0aGF0J2QgZ2l2ZSBtZSANCj4gYSBsaXN0IG9mIGFsbCB3b3JkcyBvY2N1cnJp bmcgZXhhY3RseSBvbmNlPw0KPiANCj4gIMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgIC0t IEFsc28sIGEgbGlzdCBvZiB3b3JkcyBvY2N1cnJpbmcgb25jZSwgdHdpY2Ugb3IgMyB0aW1l cw0KPiANCj4gDQo+IA0KPiByZTogaHlwaGVuYXRlZCB3b3Jkc8KgwqDCoMKgwqDCoMKgICh5 b3UgY2FuIHRyZWF0IGl0IGFueXdheSB5b3UgbGlrZSkNCj4gDQo+ICDCoMKgwqDCoMKgwqAg aWRlYWxseSwgaSdkIHRyZWF0wqAgW2VkaXRvci1pbi1jaGllZl0NCj4gIMKgwqDCoMKgwqDC oMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqAgW2dvLWFoZWFkXcKg IFtwZW4ta25pZmVdDQo+ICDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDC oMKgwqDCoMKgwqDCoMKgIFtrbm93LWhvd13CoCBbZmFyLWZldGNoZWRdIC4uLg0KPiAgwqDC oMKgwqDCoMKgIGFzIG9uZSB1bml0Lg0KDQpNYWtlIGEgbGlzdCAob3IgYXJyYXkpIG9mIHRo ZSBpbmRpdmlkdWFsIHdvcmRzIChhcyBzdHJpbmdzIG9yIHN5bWJvbHMgaW4gDQphIHNwZWNp YWwgcGFja2FnZSkgb2YgdGhlIG9yaWdpbmFsIGRvY3VtZW50IHRoZW4gc29ydCB0aGUgbGlz dCB1c2luZyB0aGUgDQpMaXNwLXN1cHBsaWVkIHNvcnQgZnVuY3Rpb24uIFlvdSB0aGFuIHdy aXRlIGEgbG9vcCB1c2luZyB5b3VyIGZhdm9yaXRlIA0KdG9vbHMgYW5kIGxvb2sgZm9yIGlu dGVyaW9yIHNlcXVlbmNlcyBvZiB0aGUgcmVxdWlyZWQgbGVuZ3RoLiBUaGlzIGdpdmVzIA0K eW91IGEgcHJvZ3JhbSB0aGF0IGlzIGFzeW1wdG90aWNhbGx5IGVmZmljaWVudCBhcyB0aGUg dGhlb3JldGljYWwgDQpydW4tdGltZSB3aWxsIGxvb2sgc29tZXRoaW5nIGxpa2UgKCogYyBO IChsb2cgTikpLCB3aGVyZSBOIGlzIHRoZSBsZW5ndGggDQpvZiB0aGUgbGlzdCBwcm9kdWNl ZCBieSB0aGUgZmlyc3Qgc3RlcCBhbmQgYyBpcyBzb21lIGNvbnN0YW50Lg0KDQpOb3RlLCBh bnkgc29sdXRpb24gcmVzZW1ibGluZyB0aGlzIG9uZSBpcyBub3QgcmVhbGx5IHdoYXQgeW91 IHdhbnQuIEZvciANCmV4YW1wbGUgaXQgd291bGQgdGhpbmsgIlNuYXJrIiBhbmQgIlNuYXJr cyIgYXJlIGRpZmZlcmVudCB3b3Jkcy4gU29tZSANCmRpZmZlcmVuY2VzIHN1Y2ggYXMgY2Fw aXRhbGl6YXRpb24gY2FuIGJlIHN1cHByZXNzZWQgYnkgY2hvb3NpbmcgYSBzb3J0IA0KcHJl ZGljYXRlIHRoYXQgaXMgY2FzZSBpbnNlbnNpdGl2ZS4gWW91IGNhbiwgb2YgY291cnNlLCB3 cml0ZSB5b3VyIG93biANCnNvcnQgcHJlZGljYXRlLiBUaGUgdGhpbmcgdG8gbm90ZSBpcyB0 aGF0IHRoZSBwcmVkaWNhdGUgKHRoZSA8PSBvcGVyYXRvciANCnVzZWQgYnkgc29ydCkgd2ls bCBub3QgYWNjZXNzIHRoZSB3b3JkcyBvciBtYWludGFpbiBzdGF0ZSBiZXR3ZWVuIA0KaW52 b2NhdGlvbnM7IG90aGVyd2lzZSwgdGhlIGNvbXBsZXhpdHkgY2FuIGJlY29tZSBhcmJpdHJh cmlseSBsYXJnZS4NCi0tIA0KSmVmZiBCYXJuZXR0DQoNCg==

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu May 30 18:45:00 2024
    XPost: comp.lang.scheme

    Given a text file of a novel (JoyceUlysses.txt) ...
    could someone give me a pretty fast (and simple) program that'd give me
    a list of all words occurring exactly once?

    tr ' .;:,?!' '\n' | sort | uniq -u

    ?


    - Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Stefan Monnier on Thu May 30 23:20:08 2024
    XPost: comp.lang.scheme

    On 2024-05-30, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Given a text file of a novel (JoyceUlysses.txt) ...
    could someone give me a pretty fast (and simple) program that'd give me
    a list of all words occurring exactly once?

    tr ' .;:,?!' '\n' | sort | uniq -u

    Yep, that's pretty much how Doug McIlroy famously shut down Knuth.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to All on Fri May 31 00:40:59 2024
    XPost: comp.lang.scheme

    could someone give me a pretty fast (and simple) program that'd give
    me a list of all words occurring exactly once?

    To first approximation, this works for me (bash command):

    tr -c "[a-zA-Z-]" "\n" < ulysses.txt |sort|uniq -c|sort -n

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From B. Pym@21:1/5 to HenHanna on Fri May 31 10:13:50 2024
    XPost: comp.lang.scheme

    On 5/30/2024, HenHanna wrote:


    i'd not use Gauche for this, but maybe someone can change my mind.


    _______________________
    From JoyceUlysses.txt -- words occurring exactly once


    Given a text file of a novel (JoyceUlysses.txt) ...

    could someone give me a pretty fast (and simple) program that'd give me a list of all words occurring exactly once?

    -- Also, a list of words occurring once, twice or 3 times



    re: hyphenated words (you can treat it anyway you like)

    ideally, i'd treat [editor-in-chief]
    [go-ahead] [pen-knife]
    [know-how] [far-fetched] ...
    as one unit.

    Gauche Scheme

    (use file.util) ;; file->string
    (use srfi-13) ;; character sets
    (use srfi-14) ;; string-tokenize

    (define h (make-hash-table 'string=?))

    (dolist
    (s
    (string-tokenize (file->string "Alice.txt")
    (char-set-adjoin char-set:letter #\-)))
    (hash-table-update! h
    (regexp-replace* (string-upcase s) #/^-+/ "" #/-+$/ "")
    (pa$ + 1) 0))

    (filter (lambda(kv) (< (cdr kv) 3))
    (hash-table->alist h))

    ===>

    (("LASTED" . 2) ("WAY--NEVER" . 1) ("VISIT" . 1) ("CHANCED" . 1)
    ("WILDLY" . 2) ("BEHEAD" . 1) ("PROMISE" . 1) ("MEANWHILE" . 1)
    ("ENGAGED" . 1) ("KNIFE" . 2) ("ROARED" . 1) ("RETIRE" . 1)
    ("BLACKING" . 1) ("HATED" . 1) ("BRIGHT-EYED" . 1)
    ("SHEEP-BELLS" . 1) ("PROTECTION" . 1) ("CRIES" . 1) ("ADA" . 1)
    ("ENJOY" . 1) ("WRITHING" . 1) ("RAW" . 1) ("APPEALED" . 1)
    ("RELIEVED" . 1) ("CHILDHOOD" . 1) ("WEPT" . 1) ("RACE-COURSE" . 1)
    ("THEIRS" . 1) ("MAD--AT" . 1) ("SPOKEN" . 1) ("PENCILS" . 1)
    ("CLEAR" . 2) ("TREADING" . 2) ("RETURNED" . 2) ("CHERRY-TART" . 1)
    ("UNEASY" . 1) ("LOW-SPIRITED" . 1) ("BONE" . 1) ("PROMISED" . 1)
    ("HAPPENING" . 1) ("OYSTER" . 1) ("PATIENTLY" . 2) ("NEEDS" . 1)
    ("LESSON-BOOK" . 1) ("PITIED" . 1) ("UNCOMFORTABLY" . 1)
    ("ANTIPATHIES" . 1) ("PICTURED" . 1) ("DESPERATE" . 1)
    ("ENGRAVED" . 1)
    ...
    )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Madhu@21:1/5 to All on Sat Jun 8 22:17:18 2024
    * Kaz Kylheku <20240530161942.627@kylheku.com> :
    Wrote on Thu, 30 May 2024 23:20:08 -0000 (UTC):

    On 2024-05-30, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Given a text file of a novel (JoyceUlysses.txt) ...
    could someone give me a pretty fast (and simple) program that'd give me
    a list of all words occurring exactly once?

    tr ' .;:,?!' '\n' | sort | uniq -u

    Yep, that's pretty much how Doug McIlroy famously shut down Knuth.

    https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf

    (how do you cite this?)

    Knuth didn't invent the "hash trie" data structure for this the article,
    it was already there in TeX, in this article knuth credits Frank Liang's
    phd thesis for the data structure.

    This was one of the first things things I coded up at the time of the
    article. The fun was in designing how to best modify the structure
    without sacrificing space

    Phil Bagwell's paper "Ideal Hash Trees" described its invention
    correctly as Hash Array Mapped Tries. However at some point, (probably
    after the coming from clojure developers with "functional" pretensions?)
    the "hash trie" was appropriated meaning something else,
    something"immutable" and all that.

    At least there isn't a wiki page for it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)