Here’s how to get into a lot of trouble:
Suppose you (like me) love the intellectual wealth found in
free-form text on the Internet.
And (like me) are an reasonably competent programmer.
And (like me) have derived value and pleasure searching Twitter.
And (like me) you look at this nifty new Fediverse thing and see that it has nice
Web APIs so you could build an app to vacuum up all the stories and laments and cheers and dunks and love letters and index
’em and let everyone search ’em and find wonderful things! So you lurch into the Mastodon conversation, all excited, and blurt
out “Hey folks,
I’m gonna index all this stuff and let the world in!”
That’s when you get your face torn off.
Contents · This (too long, sorry) essay does the following:
Surveys the current opposition to Fediverse search.
Tl;dr: Privacy!
Describes the push-back from experienced Web-heads.
Tl;dr: Huh?
Outlines Mastodon’s current search capabilities.
Tl;dr: Not terrible.
Describes my position.
Tl;dr: It’s unethical to ignore privacy concerns.
Criticizes Mastodon’s current privacy capabilities.
Tl;dr: Pretty terrible.
Argues that this is a social/legal problem, not a technology problem.
Offers specific policy and legal recommendations to improve the Fediverse privacy posture.
Paints a picture of what success looks like.
Tl;dr: Good privacy, useful search.
Anti-search · It’s like this: When you post to your blog or your public Twitter account, your words and pictures instantly join your eternal public record, available to everyone who loves or hates you or doesn’t care. Who can build search engines, not to mention ML models and adTech systems and really anything else, to help the world track and follow and analyze and sell things to you.
And, if you’re vulnerable, attack you, shame you, doxx you, SWAT you, try to kill you.
The people who built Mastodon, and the ones operating large parts of it, do not want that to happen again. Full-text search (with limited exceptions) has, as a matter of choice, been left out of the software. Why? Let me give the stage to:
Lead Mastodon author Eugen Rochko: Cage the Mastodon, in particular the section entitled “Design decisions”.
@[email protected]: Hacky folks, please resist finding ways to scrape the fediverse….
The good-bye message from “Fedisearch”, a few folks who ventured forth and retreated hastily:
Due to extreme backlash from the Mastodon community we decided to end the project, it is obviously not wanted by server admins.
While our intention was to provide the end-user with a global search to find information and friends, the concerns of its usage by trolls has been far greater amongst the community.
A couple of GitHub issues, Controlling availability to search and Socially Acceptable Search, both of which have insightful discussion (particularly this note from “nightpool”) .
So, should you sally forth as related in the first paragraph above, people will say nasty things to you and tell you to please stop working on your project. Should you proceed anyhow, they will take strong measures to block you and put any instance you seem to represent at risk of de-federation. They’re serious.
I’m not speaking hypothetically. In the dying days of 2022 I watched in real-time as this eager young fellow bounced onto the stage and said he had this new full-text thing he was about to launch, it would index all the instances your instance was federated with and it was carefully built to penetrate various Mastodon blockages. And anyone who didn’t want to be scraped and indexed had to opt-out. (He also claimed it was going to be available only to “genuine admins”.)
It did not go over well. The hostility and anger among the admins was palpable, and the next day there were people following up on the thread talking about de-federating the dude’s whole instance if that was the kind of person there.
This Open Letter from the Mastodon Community is another example of eager information-harvesters running into rage.
So, don’t say you weren’t warned.
Pro-search · Perhaps you find this attitude surprising? I did, initially, and many Web veterans’ reactions range from disdainful to hostile.
Here is Alex Stamos: “I find the arguments against officially supported Fediverse search pretty tedious, as you have to be really naive to believe that a bunch of bad-faith actors aren’t already quietly archiving everything…”
Here is Ben Adida: “I probably should then rephrase to: great search is going to happen or Fediverse might well remain a niche app.”
And here is a project called Mastinator slamming the door on its way out: “The Fediverse has some big problem coming.”
There are lots more of these reactions, and they all say more or less the same thing: “Search is good, and you can’t stop it, and people are crawling your data anyhow.”
I’m a bit puzzled by that “But people are already doing it” argument. Yes, Mastodon traffic either is already or soon will be captured and filed permanently as in forever in certain government offices with addresses near Washington DC and Beijing, and quite likely one or two sketchy Peter-Thiel-financed “data aggregation” companies. That’s extremely hard to prevent but isn’t really the problem: The problem would be a public search engine that Gamergaters and Kiwifarmers use to hunt down vulnerable targets.
What Mastodon does now · Just to be clear, Mastodon offers a perfectly decent search capability. You can search hashcodes, and what’s even cooler, you can follow them like you do another person. I like this but it does tend to leave too many posts #bulging #with #ugly #hashcodes like a crazed corporate SEO vampire.
You can search your own posts and a few other useful things. So it’s not as though there’s blanket condemnation of the idea of search, just a whole lot of concern about what’s allowed and how it’s used.
Where I stand · I think privacy is good and ignoring the issue is unethical. People should be able to converse without their every word landing on a permanent global un-erasable indexed public record. Call me crazy.
Disclosure: I’ve personally been unashamedly exuberantly public on social media since the first time I stumbled onto, um, MySpace? Orkut? Can’t remember.
I like a high-intensity stream full of well-connected voices, and I like being able to get a lot of people’s attention when I have something to say that I think is important.
But my vibe shouldn’t be the only vibe on the menu. Some people just want to talk about stuff with a few people, they don’t want to be influencers or to mainline the zeitgeist.
Some people are from groups endangered by online hate and violence, or experience precarity such that they just can’t afford to have every word on the permanent record. Some people are just shy.
I am a hyperoverentitled thick-skinned white boy who can laugh publicly at online assholes without much concern for consequences. It’s crazy to think that social media should be exclusively optimized for people like me.
There are problems · To start with, notwithstanding all the above, I’d like more search too. That’s not a big problem because I think there’s a path forward that’s useful and still preserves the current privacy-centric Mastodon values.
Then there’s the big problem…
Mastodon’s privacy story is terrible! · Seriously. Unless you take special specific measures, every little snippet you post on Mastodon has a URL and anyone can fetch it with a Web browser or computer program and then… well, do whatever the hell they want with it. Mastodon as it stands today is not built to protect privacy.
You can get a sort of weak partial privacy if you:
Post in “Friends only” mode (which can be done per-post or as a default).
Protect your account so you get to approve or deny anyone who wants to follow you.
Get lucky, as in none of your followers republish your posts to the world or gateway them to the alt-right.
This will probably keep you out of some rando’s public-search-engine experiment.
But it doesn’t matter, because the vast majority of people on Mastodon don’t understand the difference between its sharing modes and probably don’t protect their account, because why should anyone have to do that?
And anyhow, we’re all…
Missing the point · The point is, we’re not trying to solve a technical problem here, we’re trying to solve a social problem. We don’t want people to do certain irritating and dangerous things with data scraped off the Fediverse. So, when there are things that people can do but shouldn’t, what tools do we usually apply? Hmm… I guess when I said “social” I meant “legal”.
So here’s a question: When I publish something, who is licensed to fetch it or, having fetched it, store it and process it, or having stored and processed it, share the results with the world, or with an employer or customer?
Mastodon doesn’t help here. When you retrieve a post, you don’t have to log in to Mastodon first, so any terms and conditions you might have agreed to don’t apply. You also don’t have to click through a terms-of-service pop-up. When you follow somebody, at no point (that I’ve seen anyhow) do you get notified of how they’d like their posts to be treated.
So why shouldn’t you feel free to go ahead and share what you’ve received to the world or, if you’re a Search weenie, write a program to follow people and index their posts?
Suggestions · Stated in the most general possible way: The Fediverse needs to get its content-licensing shit together.
I have ideas about how this might be done, which I’m about to offer, but I Am Not A Lawyer and I am especially not a copyright or intellectual-property specialist; so take these as lightweight amateur suggestions designed only to start conversation.
Disclaimers in place, I propose the following. (Note that some of these proposals are not fully compatible with each other.)
A server should deliver posts only to people logged into the instance, or to other instances it is federated with.
Servers should deliver posts only after a click-through acknowledging the license covering those posts.
The Fediverse needs to work with IP lawyers, and maybe Creative Commons, to build a menu of licenses that people can choose to apply to their posts.
When you follow someone, you should be forced to acknowledge their default content license, and re-acknowledge if and when they change the default.
The choice of default content license for an instance is very important and needs to be communicated clearly in human language not legal jargon, at the time of registration.
Many members of the current admin community would like it if the default license were always highly restrictive such that you’d have to explicitly opt in to making your posts eligible for mass harvesting. I can see their point, but if I’m building an instance for people who get paid to be public, for example journalists or DevRel people, I’d probably pick the opposite default.
The content-license menu should have a lot of options. Some line up pretty well with Mastodon’s current categories: “public”, “unlisted”, and “followers only”. But I can imagine finer-grained exclusions, such as allowing full-text indexing but only for accounts on the same instance, or allowing use for search but no other applications. (No ML model building!)
I’m also pretty sure that content licensing should have a temporal component. That is to say “Yes, harvest this and use it, but only for two weeks beginning now.” Mastodon already has optional built-in scheduled post deletion and this would have to be consistent with that.
I’m pretty sure I’m missing important dimensions. And I’m totally sure that creating the dialogues necessary to support this constitute a UX designer’s nightmare.
Most important, I’m convinced that this is a conversation that the Fediverse leaders need to start having, and start having now.
What success looks like · I’d like it if nobody were ever deterred from conversing with people they know for fear that people they don’t know will use their words to attack them. I’d like it to be legally difficult to put everyone’s everyday conversations to work in service to the advertising industry. I’d like to reduce the discomfort people in marginalized groups feel venturing forth into public conversation.
But… I’d also like to search the world’s conversation to find out what’s happening right now. How are things going around Bakhmut? How are people feeling about the latest shows in the Sierra Ferrell tour? What’s being posted about the World Cup semifinal? Are the British Tories about to knife another idiot leader?
And especially this: How are they doing on fixing the winter power outage in Saskatchewan, where my elderly mother lives? Not hypothetical; it happened the evening of December 27th, and being able to track the status with Twitter search meant I didn’t have to organize an emergency intervention from two time-zones away.
I’d also be interested in a certain amount of historic search: What exactly were world leaders saying around last February 24th? How did they describe that new AWS feature at re:Invent 2017? And so on; but only if I’m confident the people who posted what I’m searching are comfortable with them being used this way.
I think we should be able to get there. But it’s not a technology problem.