Let's get straight to the controversial bit: email address validation. A penny-drop moment during this week's video was that the native browser address validator rejects many otherwise RFC compliant forms. As an example, I asked ChatGTP about the validity of the pipe symbol during the live stream and according to the AI, it's permissible "when properly quoted":
"john|doe"@example.com
Give that a go and see how far you get in an input of type "email". Mind you, that example allows a pipe when not quoted. And the more you read, the more contradictory things seem; try this Stack Overflow question about allowable characters in an address and you'll get a heap of "yeah, that one is allowed but only if quoted"... which means it won't work in an email input box! (Unless you use the "pattern" attribute and a regex that permits it - argh!)
tl;dr - especially for the purpose in question - extracting email addresses from a data dump - I think I'm just going to boilthis down to a handful of permissible characters that are broadly accepted by websites and just stick with those. If you're a unique enough snowflake to be putting a quoted pipe in your alias then you're clearly not signing up to very many websites.
References
- Sponsored by: Report URI: Guarding you from rogue JavaScript! Don’t get pwned; get real-time alerts & prevent breaches #SecureYourSite
- It just went from bad to worse for Onerep with Mozilla cutting ties (it's hard to imagine they really had any choice left)
- Is the alleged AT&T breach really just "alleged"? (read the comments on that blog post and see what you think...)
- MediaWorks in NZ got breached and their data spread all over the place (although the data is pretty benign in the scheme of things)
- But hey, at least MediaWorks had some solid advice around protecting yourself online! (checking if you were included in "other" breaches now needs a bit of a revision...)