• archiving twitter

    From Eli the Bearded@21:1/5 to All on Sun Nov 20 04:09:27 2022
    This is not a super polished method (set of methods), but will likely
    help people out.

    You can download an archive of your own account easily with Twitter's
    own tools. People are reporting that it takes about 48 hours from
    request to completion.

    The completed archive is a ZIP file intended to work as a web page in a browser. I have not actually tried that, but unzipped it and started to
    use the files inside.

    In the zip there's an assets/ directory with stuff to support the "as a
    web page" view, including, apparently, PNG files for every emoji.
    There's also a data/ directory that is personal to your account.

    Of note in the data directory:

    All your Tweets in JSON:
    data/tweets.js

    All the images & video for your tweets (includes retweets):
    data/tweets_media/

    All your Direct Messages in JSON:
    data/direct-messages.js

    All the images & video for your messages:
    data/direct_messages_media/

    List of accounts following you:
    data/follower.js

    List of accounts you follow:
    data/following.js

    List of tweets you have liked:
    data/like.js

    Gotchas / warnings / limitaions:

    1. There *does not seem* to be a list of your bookmarks.

    2. The archive does not contain the alt text you may have put on images.
    (Alt text was limited to 1500 characters instead of 280, so it was
    handy sometimes for squeezing more text into a tweet, even if
    partially hidden.)

    3. Images in the media folders might not be the largest size twitter has
    for your account.

    4. Some JSON files have both Twitter short links (https://t.co/...) and
    expanded URLs, while some just have the short links.

    For point 2: there's an archiver tool here from people who do alt-text
    type stuff in general:

    https://archive.alt-text.org/
    https://github.com/alt-text-org/tweet-alt-archive

    For point 3: There's a tool here you can run to get full size images:

    https://github.com/timhutton/twitter-archive-parser

    For point 4: I've looped over mine with a simple shell script. Basically

    # GNU grep has -o to only include part of line that matches
    for link in $( grep -h -o 'https*://t.co/[a-zA-Z0-9]*' \
    data/tweets.js data/like.js |
    sort -u ); do
    printf "\n%s: " "$link"
    curl -w '%{redirect_url}' -o /dev/null -s "$link"
    sleep 5
    done > expanded-tco.links

    For point 1: I haven't found anything except manual work yet to get the bookmarks.

    "Okay, GREAT!" you say, "But what about archiving stuff that is not in
    my account? Like what if I want to save my liked tweets with images and
    video? Or tweets I've posted to Usenet over the years? Or someone else's account's public tweets?"

    Here's a list of tools the data hoarders of Reddit have collected: https://www.reddit.com/r/DataHoarder/comments/yy7tig/backup_twitter_now_multiple_critical_infra_teams/

    Personally I like Social Network Scraper, snscrape, from that list. It's python3 and in pip:

    $ sudo apt-get install python3-pip # eg for Ubuntu
    $ pip3 install snscrape

    Take care that *where* pip installs it is on your $PATH, and then
    you are ready to go. The usage example for snscrape is a bit vague.
    I've found there are two useful modes: entire account and single tweet.

    $ account=NanoRaptor
    $ snscrape --jsonl twitter-user $account > $account.json

    Verify $account.json looks good (for some accounts I'm not getting
    much) then extract media URLs:

    $ jq -r '.media[] | .fullUrl' $account.json 2>/dev/null > image.links
    $ jq -r '.media[] | .variants[] | .url' \
    $account.json 2>/dev/null > video.links

    Use 2>/dev/null because you'll get a ton of "Cannot iterate over null"
    errors for tweets without images or video. The video.links will include
    a lot of alternatives for some tweets, and just a single one for
    others. I don't have a good way of picking "best" automatically.

    For the single tweet mode, I've been using snscrape like this:

    # links.txt is a list of URLs one per line, like
    # https://twitter.com/Uriji1/status/1398430745035747336

    $ for id in $( rev links.txt | cut -f 1 -d / | rev ) ; do
    # you'll get a Traceback stackdump for deleted
    # links or deleted accounts
    snscrape --jsonl twitter-tweet $id > $id.json

    jq -r '.media[] | .fullUrl' $id.json >> image.links 2>/dev/null
    jq -r '.media[] | .variants[] | .url' $id.json >> video.links 2>/dev/null
    done

    Download the images. The links look like:

    # source tweet:
    # https://twitter.com/Uriji1/status/1398430745035747336 https://pbs.twimg.com/media/E2g5AncXEAQdqcP?format=jpg&name=large https://pbs.twimg.com/media/E2g5C_3WUAEBipg?format=jpg&name=large https://pbs.twimg.com/media/E2g5EMbWYAQE6ha?format=jpg&name=large

    This finds a suffix and isolates the ID of the file.

    $ for line in $( sort -u image.links ) ; do
    case "$line" in
    *format=jpg*) suf=jpg ;;
    *format=png*) suf=png ;;
    *) suf=other ;;
    esac

    burl=${line%?format=*} # ${variable%GLOB} remove from end
    id=${burl#*/media/} # ${variable#GLOB} remove from start

    curl -o "$id.$suf" "$line"
    done

    Download the simple case videos.

    # source tweet:
    # https://twitter.com/silentmoviegifs/status/1517383816884727809 https://video.twimg.com/tweet_video/FQ7UL8wXwAACEGL.mp4

    $ for line in $( grep /tweet_video/ video.links | sort -u) ; do
    curl -O "$line"
    done

    Hard case ones look like this, multiple variants for the same file:

    # source tweet:
    # https://twitter.com/AppleIIBot/status/1588678248023871489

    https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/364x270/jid0Xz9s7x4J79mH.mp4?tag=12
    https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/850x630/yYoWWRnA3mXgW0oo.mp4?tag=12
    https://video.twimg.com/ext_tw_video/1588678211571154944/pu/pl/ttqk_5h8PGDIB0IW.m3u8?tag=12&container=fmp4
    https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/484x360/4jU3htO-XaNrR9y7.mp4?tag=12

    I haven't started to deal with those yet. I suspect the /vid/WWWxHHH/
    format will be the easiest to deal with, and select largest width by
    height for a given /ext_tw_video/IDNUMBER/ .

    Happy archiving, and share tips you may have found.

    Elijah
    ------
    has 1.5G in ~/twitter/ so far

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Eli the Bearded on Sun Nov 20 05:47:53 2022
    On Sun, 20 Nov 2022 04:09:27 -0000 (UTC)
    Eli the Bearded <*@eli.users.panix.com> wrote:
    This is not a super polished method (set of methods), but will likely
    help people out.

    You can download an archive of your own account easily with Twitter's
    own tools. People are reporting that it takes about 48 hours from
    request to completion.

    You mean it takes 48 hours regardless of how much one has on their account ? Whether one has 1 tweet or thousands of them ?

    The completed archive is a ZIP file intended to work as a web page in a browser. I have not actually tried that, but unzipped it and started to
    use the files inside.

    Is there a way for someone to progressively get newer stuff or is the only
    way for someone who wants a complete personal copy to redownload everything from scratch every now and again ?

    --
    vlaho.ninja/prog

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to spibou@gmail.com on Sun Nov 20 08:54:37 2022
    In comp.misc, Spiros Bousbouras <spibou@gmail.com> wrote:
    You mean it takes 48 hours regardless of how much one has on their
    account ? Whether one has 1 tweet or thousands of them ?

    Apparently. Not sure if deliberate design, or just a lo g queue with few resources devoted to the queue.

    Is there a way for someone to progressively get newer stuff or is the
    only way for someone who wants a complete personal copy to redownload everything from scratch every now and again ?

    No progressive updates. And I think one request for full backup per
    week.

    I strongly believe Twitter is on borrowed time and will be collapsing
    soon. Musk is bringing Trump back, World Cup is coming, legal challenges
    will nasty soon (automatic copyright violation moderation has stopped
    working, so DCMA requests will surge).

    I don't believe it will make it to 2023, and getting to December is
    iffy.

    Weekly updates are a minor concern.

    Elijah
    ------
    unsure how well Twitter will even doneith end of year tax filings for ex-staff

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Eli the Bearded on Mon Nov 21 06:58:46 2022
    Eli the Bearded <*@eli.users.panix.com> wrote:

    I strongly believe Twitter is on borrowed time and will be collapsing
    soon. Musk is bringing Trump back, World Cup is coming, legal challenges
    will nasty soon (automatic copyright violation moderation has stopped working, so DCMA requests will surge).

    I don't believe it will make it to 2023, and getting to December is
    iffy.

    From what I hear it sounds like he's making Twitter somewhat more
    similar to Usenet so far as policies go, so in principle I can't
    object. But they alienated me years ago when their webpages stopped
    displaying in Dillo anyway (not that I ever viewed it often).

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)