I saw an idea for a nerdy gift on Reddit for your significant other. A word-cloud of conversations you have had. That seemed like a quick and easy idea for a anniversary card…
Now that I’m done, I can explain what a massive pain this turned out to be.
Creating a Word-cloud
Actually generating a word-cloud from some text was easy. Using a Open Source Python implementation, you read a text file, select a image you want to constrain the word cloud too and add stop words to be ignored.
The default image is a rectangle, but any black silhouette can be used in it’s place. This is also the easiest way to set the output resolution of the Word-cloud, which matches the silhouette image by default.
To extract chat messages from Facebook messenger, use the data download service they provide. You can select to just download chat messages, which you should get as JSON, not HTML. Once you have the zip archive, extract the chat you want, and parse our the content tags. Dump everything into a text file.
In a WhatsApp chat, click on the ⋮ symbol, select more and export chat. If you select “without media” then you get the chat as a text file, but won’t get the captions from media content. Exporting with media never seemed to work for me, possibly due to the large number of images in the chat.
It’s fairly easy to find and replace the date and name prefix to each message using Regex, leaving you with a clean text file.
Here is where things get tricky. It is not straightforward to get your messages out of Signal. Even though you can decrypt the backups, I found that chat ID’s were missing, making it impossible to separate out the messages I wanted.
Here are the steps I eventually came up with:
- Backup Signal (make sure to write down the 30 digit key)
- On a different phone, create a new account (you will need another phone number for this)
- Restore the Signal backup on the new phone
- Delete all chats which you don’t want on the new phone
- Backup Signal again on the new phone (record the key again)
- Copy the encrypted database to a computer
- Use Signal-back to decrypt the backup as a CSV
- Parse the messages out of the CSV
Finally, you will probably see some strange results in the final output. Symbols will be removed, so one of the top words was “https” taken from links we had shared.
Here is an example word-cloud taken from all the previous posts on this site.