Forum: >>> Magnum BBS <<<

archiving twitter

From Eli the Bearded@21:1/5 to All on Sun Nov 20 04:09:27 2022

This is not a super polished method (set of methods), but will likely
help people out.

You can download an archive of your own account easily with Twitter's
own tools. People are reporting that it takes about 48 hours from
request to completion.

The completed archive is a ZIP file intended to work as a web page in a browser. I have not actually tried that, but unzipped it and started to
use the files inside.

In the zip there's an assets/ directory with stuff to support the "as a
web page" view, including, apparently, PNG files for every emoji.
There's also a data/ directory that is personal to your account.

Of note in the data directory:

All your Tweets in JSON:
data/tweets.js

All the images & video for your tweets (includes retweets):
data/tweets_media/

All your Direct Messages in JSON:
data/direct-messages.js

All the images & video for your messages:
data/direct_messages_media/

List of accounts following you:
data/follower.js

List of accounts you follow:
data/following.js

List of tweets you have liked:
data/like.js

Gotchas / warnings / limitaions:

1. There *does not seem* to be a list of your bookmarks.

2. The archive does not contain the alt text you may have put on images.
(Alt text was limited to 1500 characters instead of 280, so it was
handy sometimes for squeezing more text into a tweet, even if
partially hidden.)

3. Images in the media folders might not be the largest size twitter has
for your account.

4. Some JSON files have both Twitter short links (https://t.co/...) and
expanded URLs, while some just have the short links.

For point 2: there's an archiver tool here from people who do alt-text
type stuff in general:

https://archive.alt-text.org/
https://github.com/alt-text-org/tweet-alt-archive

For point 3: There's a tool here you can run to get full size images:

https://github.com/timhutton/twitter-archive-parser

For point 4: I've looped over mine with a simple shell script. Basically

# GNU grep has -o to only include part of line that matches
for link in $( grep -h -o 'https*://t.co/[a-zA-Z0-9]*' \
data/tweets.js data/like.js |
sort -u ); do
printf "\n%s: " "$link"
curl -w '%{redirect_url}' -o /dev/null -s "$link"
sleep 5
done > expanded-tco.links

For point 1: I haven't found anything except manual work yet to get the bookmarks.

"Okay, GREAT!" you say, "But what about archiving stuff that is not in
my account? Like what if I want to save my liked tweets with images and
video? Or tweets I've posted to Usenet over the years? Or someone else's account's public tweets?"

Here's a list of tools the data hoarders of Reddit have collected: https://www.reddit.com/r/DataHoarder/comments/yy7tig/backup_twitter_now_multiple_critical_infra_teams/

Personally I like Social Network Scraper, snscrape, from that list. It's python3 and in pip:

$ sudo apt-get install python3-pip # eg for Ubuntu
$ pip3 install snscrape

Take care that *where* pip installs it is on your $PATH, and then
you are ready to go. The usage example for snscrape is a bit vague.
I've found there are two useful modes: entire account and single tweet.

$ account=NanoRaptor
$ snscrape --jsonl twitter-user $account > $account.json

Verify $account.json looks good (for some accounts I'm not getting
much) then extract media URLs:

$ jq -r '.media[] | .fullUrl' $account.json 2>/dev/null > image.links
$ jq -r '.media[] | .variants[] | .url' \
$account.json 2>/dev/null > video.links

Use 2>/dev/null because you'll get a ton of "Cannot iterate over null"
errors for tweets without images or video. The video.links will include
a lot of alternatives for some tweets, and just a single one for
others. I don't have a good way of picking "best" automatically.

For the single tweet mode, I've been using snscrape like this:

# links.txt is a list of URLs one per line, like
# https://twitter.com/Uriji1/status/1398430745035747336

$ for id in $( rev links.txt | cut -f 1 -d / | rev ) ; do
# you'll get a Traceback stackdump for deleted
# links or deleted accounts
snscrape --jsonl twitter-tweet $id > $id.json

jq -r '.media[] | .fullUrl' $id.json >> image.links 2>/dev/null
jq -r '.media[] | .variants[] | .url' $id.json >> video.links 2>/dev/null
done

Download the images. The links look like:

# source tweet:
# https://twitter.com/Uriji1/status/1398430745035747336 https://pbs.twimg.com/media/E2g5AncXEAQdqcP?format=jpg&name=large https://pbs.twimg.com/media/E2g5C_3WUAEBipg?format=jpg&name=large https://pbs.twimg.com/media/E2g5EMbWYAQE6ha?format=jpg&name=large

This finds a suffix and isolates the ID of the file.

$ for line in $( sort -u image.links ) ; do
case "$line" in
*format=jpg*) suf=jpg ;;
*format=png*) suf=png ;;
*) suf=other ;;
esac

burl=${line%?format=*} # ${variable%GLOB} remove from end
id=${burl#*/media/} # ${variable#GLOB} remove from start

curl -o "$id.$suf" "$line"
done

Download the simple case videos.

# source tweet:
# https://twitter.com/silentmoviegifs/status/1517383816884727809 https://video.twimg.com/tweet_video/FQ7UL8wXwAACEGL.mp4

$ for line in $( grep /tweet_video/ video.links | sort -u) ; do
curl -O "$line"
done

Hard case ones look like this, multiple variants for the same file:

# source tweet:
# https://twitter.com/AppleIIBot/status/1588678248023871489

https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/364x270/jid0Xz9s7x4J79mH.mp4?tag=12
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/850x630/yYoWWRnA3mXgW0oo.mp4?tag=12
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/pl/ttqk_5h8PGDIB0IW.m3u8?tag=12&container=fmp4
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/484x360/4jU3htO-XaNrR9y7.mp4?tag=12

I haven't started to deal with those yet. I suspect the /vid/WWWxHHH/
format will be the easiest to deal with, and select largest width by
height for a given /ext_tw_video/IDNUMBER/ .

Happy archiving, and share tips you may have found.

Elijah
------
has 1.5G in ~/twitter/ so far

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Spiros Bousbouras@21:1/5 to Eli the Bearded on Sun Nov 20 05:47:53 2022

On Sun, 20 Nov 2022 04:09:27 -0000 (UTC)
Eli the Bearded <*@eli.users.panix.com> wrote:

This is not a super polished method (set of methods), but will likely
help people out.

You can download an archive of your own account easily with Twitter's
own tools. People are reporting that it takes about 48 hours from
request to completion.

You mean it takes 48 hours regardless of how much one has on their account ? Whether one has 1 tweet or thousands of them ?

The completed archive is a ZIP file intended to work as a web page in a browser. I have not actually tried that, but unzipped it and started to
use the files inside.

Is there a way for someone to progressively get newer stuff or is the only
way for someone who wants a complete personal copy to redownload everything from scratch every now and again ?

--
vlaho.ninja/prog

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to spibou@gmail.com on Sun Nov 20 08:54:37 2022

In comp.misc, Spiros Bousbouras <spibou@gmail.com> wrote:

You mean it takes 48 hours regardless of how much one has on their
account ? Whether one has 1 tweet or thousands of them ?

Apparently. Not sure if deliberate design, or just a lo g queue with few resources devoted to the queue.

Is there a way for someone to progressively get newer stuff or is the
only way for someone who wants a complete personal copy to redownload everything from scratch every now and again ?

No progressive updates. And I think one request for full backup per
week.

I strongly believe Twitter is on borrowed time and will be collapsing
soon. Musk is bringing Trump back, World Cup is coming, legal challenges
will nasty soon (automatic copyright violation moderation has stopped
working, so DCMA requests will surge).

I don't believe it will make it to 2023, and getting to December is
iffy.

Weekly updates are a minor concern.

Elijah
------
unsure how well Twitter will even doneith end of year tax filings for ex-staff

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Eli the Bearded on Mon Nov 21 06:58:46 2022

Eli the Bearded <*@eli.users.panix.com> wrote:

I strongly believe Twitter is on borrowed time and will be collapsing
soon. Musk is bringing Trump back, World Cup is coming, legal challenges
will nasty soon (automatic copyright violation moderation has stopped working, so DCMA requests will surge).

I don't believe it will make it to 2023, and getting to December is
iffy.

From what I hear it sounds like he's making Twitter somewhat more
similar to Usenet so far as policies go, so in principle I can't
object. But they alienated me years ago when their webpages stopped
displaying in Dillo anyway (not that I ever viewed it often).

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Thu Jan 2 22:34:20 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Keyop
  Thu Jan 2 21:35:52 2025
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Thu Jan 2 21:33:29 2025
  from Wales, Uk via Telnet
- Guest
  Thu Jan 2 21:03:01 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Ginger1
  Thu Jan 2 20:36:28 2025
  from London via SSH
- Ginger1
  Thu Jan 2 20:24:14 2025
  from London via SSH
- Guest
  Thu Jan 2 18:10:40 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Guest
  Thu Jan 2 18:04:51 2025
  from /bin/busybox Cat /proc/self/ex via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	388
Nodes:	16 (2 / 14)
Uptime:	05:51:18
Calls:	8,220
Calls today:	18
Files:	13,122
Messages:	5,872,261
Posted today:	1

archiving twitter

Who's Online

Recent Visitors

System Info