This is not a super polished method (set of methods), but will likely
help people out.
You can download an archive of your own account easily with Twitter's
own tools. People are reporting that it takes about 48 hours from
request to completion.
The completed archive is a ZIP file intended to work as a web page in a browser. I have not actually tried that, but unzipped it and started to
use the files inside.
In the zip there's an assets/ directory with stuff to support the "as a
web page" view, including, apparently, PNG files for every emoji.
There's also a data/ directory that is personal to your account.
Of note in the data directory:
All your Tweets in JSON:
data/tweets.js
All the images & video for your tweets (includes retweets):
data/tweets_media/
All your Direct Messages in JSON:
data/direct-messages.js
All the images & video for your messages:
data/direct_messages_media/
List of accounts following you:
data/follower.js
List of accounts you follow:
data/following.js
List of tweets you have liked:
data/like.js
Gotchas / warnings / limitaions:
1. There *does not seem* to be a list of your bookmarks.
2. The archive does not contain the alt text you may have put on images.
(Alt text was limited to 1500 characters instead of 280, so it was
handy sometimes for squeezing more text into a tweet, even if
partially hidden.)
3. Images in the media folders might not be the largest size twitter has
for your account.
4. Some JSON files have both Twitter short links (
https://t.co/...) and
expanded URLs, while some just have the short links.
For point 2: there's an archiver tool here from people who do alt-text
type stuff in general:
https://archive.alt-text.org/
https://github.com/alt-text-org/tweet-alt-archive
For point 3: There's a tool here you can run to get full size images:
https://github.com/timhutton/twitter-archive-parser
For point 4: I've looped over mine with a simple shell script. Basically
# GNU grep has -o to only include part of line that matches
for link in $( grep -h -o 'https*://t.co/[a-zA-Z0-9]*' \
data/tweets.js data/like.js |
sort -u ); do
printf "\n%s: " "$link"
curl -w '%{redirect_url}' -o /dev/null -s "$link"
sleep 5
done > expanded-tco.links
For point 1: I haven't found anything except manual work yet to get the bookmarks.
"Okay, GREAT!" you say, "But what about archiving stuff that is not in
my account? Like what if I want to save my liked tweets with images and
video? Or tweets I've posted to Usenet over the years? Or someone else's account's public tweets?"
Here's a list of tools the data hoarders of Reddit have collected:
https://www.reddit.com/r/DataHoarder/comments/yy7tig/backup_twitter_now_multiple_critical_infra_teams/
Personally I like Social Network Scraper, snscrape, from that list. It's python3 and in pip:
$ sudo apt-get install python3-pip # eg for Ubuntu
$ pip3 install snscrape
Take care that *where* pip installs it is on your $PATH, and then
you are ready to go. The usage example for snscrape is a bit vague.
I've found there are two useful modes: entire account and single tweet.
$ account=NanoRaptor
$ snscrape --jsonl twitter-user $account > $account.json
Verify $account.json looks good (for some accounts I'm not getting
much) then extract media URLs:
$ jq -r '.media[] | .fullUrl' $account.json 2>/dev/null > image.links
$ jq -r '.media[] | .variants[] | .url' \
$account.json 2>/dev/null > video.links
Use 2>/dev/null because you'll get a ton of "Cannot iterate over null"
errors for tweets without images or video. The video.links will include
a lot of alternatives for some tweets, and just a single one for
others. I don't have a good way of picking "best" automatically.
For the single tweet mode, I've been using snscrape like this:
# links.txt is a list of URLs one per line, like
#
https://twitter.com/Uriji1/status/1398430745035747336
$ for id in $( rev links.txt | cut -f 1 -d / | rev ) ; do
# you'll get a Traceback stackdump for deleted
# links or deleted accounts
snscrape --jsonl twitter-tweet $id > $id.json
jq -r '.media[] | .fullUrl' $id.json >> image.links 2>/dev/null
jq -r '.media[] | .variants[] | .url' $id.json >> video.links 2>/dev/null
done
Download the images. The links look like:
# source tweet:
#
https://twitter.com/Uriji1/status/1398430745035747336 https://pbs.twimg.com/media/E2g5AncXEAQdqcP?format=jpg&name=large https://pbs.twimg.com/media/E2g5C_3WUAEBipg?format=jpg&name=large https://pbs.twimg.com/media/E2g5EMbWYAQE6ha?format=jpg&name=large
This finds a suffix and isolates the ID of the file.
$ for line in $( sort -u image.links ) ; do
case "$line" in
*format=jpg*) suf=jpg ;;
*format=png*) suf=png ;;
*) suf=other ;;
esac
burl=${line%?format=*} # ${variable%GLOB} remove from end
id=${burl#*/media/} # ${variable#GLOB} remove from start
curl -o "$id.$suf" "$line"
done
Download the simple case videos.
# source tweet:
#
https://twitter.com/silentmoviegifs/status/1517383816884727809 https://video.twimg.com/tweet_video/FQ7UL8wXwAACEGL.mp4
$ for line in $( grep /tweet_video/ video.links | sort -u) ; do
curl -O "$line"
done
Hard case ones look like this, multiple variants for the same file:
# source tweet:
#
https://twitter.com/AppleIIBot/status/1588678248023871489
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/364x270/jid0Xz9s7x4J79mH.mp4?tag=12
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/850x630/yYoWWRnA3mXgW0oo.mp4?tag=12
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/pl/ttqk_5h8PGDIB0IW.m3u8?tag=12&container=fmp4
https://video.twimg.com/ext_tw_video/1588678211571154944/pu/vid/484x360/4jU3htO-XaNrR9y7.mp4?tag=12
I haven't started to deal with those yet. I suspect the /vid/WWWxHHH/
format will be the easiest to deal with, and select largest width by
height for a given /ext_tw_video/IDNUMBER/ .
Happy archiving, and share tips you may have found.
Elijah
------
has 1.5G in ~/twitter/ so far
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)