Improve your Twitter experience with regular expressions
Mike Lynch 2014
My favourite Twitter client, Tweetbot, allows muting not just by keyword but by regular expressions. Regular expressions, or regexps, are an incredibly handy way to match patterns in text: they date back to the earliest years of computing and are available in most programming languages.
About a month ago, I posted a regexp to filter out tweets about a certain Australian drug smuggler, and I was asked if I had any other examples. It occurred to me that they could form a quick introductory tutorial to regexps in general.
First, an unavoidable jargon intermission: I'm going to use the programming term "string" to refer to any sequence of characters, because non-jargon terms like "word" or "sentence" are not general enough and would just confuse things, and otherwise we'll all get sick of me repeating the phrase "sequence of characters".
This tutorial isn't a complete description of regular expressions - people have written whole books on that. It should give you a basic understanding of the syntax of regexps, explain how a couple of my filters work and allow you to start building your own.
A regexp is a string which is used to test other strings to see if they either match or don't match a pattern. In the context of Tweetbot's mute feature, the regexp behaves like a special keyword filter which is used to test the tweets in your timeline: any tweet that matches is muted. To try to explain what matching means, I'm going to start with an irritatingly opaque definition:
A string matches a regexp if every part of the regexp matches a part of the string in the right order.
We'll flesh this out by building a regexp which will stop me from seeing tweets with hashtags like #auspol.
The simplest kind of "part" is an ordinary character. Letters and numbers are ordinary, as are some punctuation marks. An ordinary character matches itself, so if a regexp is just a sequence of ordinary characters, it will match any tweet containing that entire sequence.
#auspol
> You leftards all suck #auspol
Get bent #nswpol
In the examples, the regexp itself appears in blue, matched tweets are in green (with a > next to them) and tweets that don't match in dark red. The actual characters which the regexp is matching against are highlighted like: this.
Note that although letters and numbers are all ordinary characters, many symbols and punctuation marks are "ordinary" (in the regexp sense) too, like hash "#".
Matching nothing but ordinary characters is exactly the same as a keyword match. Boring. Can we do something better?
Tweetbot's regexp functionality is tucked away. If you use any special characters when typing in a keyword mute, a switch to turn on regular expressions will appear under the keyword field. The special characters are:
.*+?$^\[]{}()
.
is a wildcard: it will match any character. So
#...pol
matches a "#" followed by any three characters followed by
"pol".
#...pol
> Down with things! #auspol
> Feelpinions! #nswpol
> Bikies! #qldpol
Independence now! #scotlandpol
The regexp fails to weed out tweets with my (invented) "#scotlandpol" hashtag because there are too many characters between "#" and "pol". But I don't want to have to make a series of mutes like
#...pol
#....pol
#.....pol
Luckily, I don't need to. It's at this point that regexps start getting genuinely powerful.
If we follow part of a regexp by +
it will try to match against
one or more repetitions of that part.
I should clarify what "part" means a little here: an ordinary
character is a part, but so is a full stop .
and other special
characters or character combinations. The simplest way to put it is that a part
is a portion of a regexp which makes sense as a regexp on its own. (We'll get
more precise about this when we get to groups.)
Since .
matches any character, .+
is a single
pattern which is equivalent to .
, ..
,
...
, ....
, and so on. This lets us build a regexp
which will match all whatever-pol hashtags:
#.+pol
> You suck #auspol
> I hate people #nswpol
> You are worse than germs #vicpol
> Gold for Australia! woo! #worldpoledancing
We've hit one of the problems with regexps: it's really easy to match more than we intended to. Our new regexp filters out the hashtag of the World Pole Dancing championship which we are keenly watching on Channel 11, because it contains the string "#worldpol", which matches #.+pol
.
How can we build a regexp which will match only those hashtags which end in pol? We need to make sure that the character following 'pol' is not another letter or number: that is, we need a part which will match a space.
Regexps have built-in character classes: these are like .
but only match certain kinds of characters.
\w
matches "word" characters (a-b, A-Z, 0-9 and _)
\d
matches digits
\s
matches whitespace characters (space, tabs)
\n
matches returns/linefeeds
So, if we add \s
to the end of our regexp, we can see our poledancing tweets again:
#.+pol\s
> You guys all suck #auspol is for losers
HOTT #worldpoledancing
Abbott4lyfe #auspol
GLADYS *shakes fist* #nswpol
Bugger. We broke our regexp. Because it expects hashtags to be followed by a space, it doesn't match those tweets with the hashtag at the end of the string.
Luckily, there's another special character designed for exactly this problem: \b
matches the boundary of a word
So far, all of our special characters have matched a particular character or
characters in the target string. \b
is different: it matches
succesfully if it occurs at the "edge" between a word and either a
non-word character or the start or end of the whole string. (Technically it's a
"zero-width assertion" which I mention just because I like the
phrase.)
#.+pol\b
> Shorten in bread #auspol
Nice legs #worldpoledancing
We've finally found a regexp which matches political hashtags, without also matching hashtags that happen to contain the characters "pol" somewhere other than at the end.
Two other widely-used boundary characters are ^
and
$
, which allow you to match at the beginning or the end of the
entire string.
I dislike the practice of dotmentioning - when you reply to someone but put a period in front of their name so that your other followers can see your reply. To me, it's a bit like inviting other people to join in on a fight, so I'd rather not see them.
A regexp to match tweets which start with something like ".@username" will need a couple of features which I haven't explained yet, so it's worth going through in detail. It will also be typical of many useful regexps in that it looks a bit like a comic book character swearing.
We want the regexp to match only at the start of a tweet, so we start with the special character for head:
^
The next thing we want to match is the full stop: however, as we've seen,
.
is a special character which will match anything. To tell the
regexp parser to ignore the special meaning and just use it literally, we add a
full stop but precede it with a backslash: \
.
^\.
Note that you use this technique to match the literal value of any special
character. Including the backslash itself. The sequence \\
matches
a backslash and is the sort of thing that causes the regexp disease known as
"leaning toothpick syndrome".
We now have a regexp which will match all tweets starting with a full stop. To limit this to only dotmentions, we add an "@", which is not a special character, so there's no need for a \
this time:
^\.@
We don't really need to add anything after the "@", unless we are only trying to filter out dotmentions to a particular user:
^\.@TonyAbbottMHR
The problem with the regexp ^\.@
is that some people put a space between the dot and the username they're replying to:
^\.@
> .@TonyAbbottMHR THIS CRIMINAL GOVERNMENT
. @TurnbullMalcolm If only the Czar knew!
. @spikelynch People should use more spaces!
(The second example was shamelessly stolen from @manthatcooks.)
We've already seen that +
is a special character which when placed
after part of a regexp, will allow that part to match one or more times. In this
case, we want to match both tweets with one or more spaces between "."
and "@", and tweets with no spaces at all. To match "none or
more" of a pattern, we can use the special character *
after
the whitespace character class \s
:
^\.\s*@
The \s*
matches any amount of whitespace, including no whitespace at all, so this will match any tweet that starts with ".@", or a "." and a "@" separated by whitespace.
Another useful repetition modifier is ?
, which matches none or one of a pattern. For example:
ca?t
> Architect
> Black cat
gatattcaataca
The combination of the full stop and asterisk - .*
- is a common idiom in regexps because it matches none or more of anything: it's useful for when you don't care or know about what goes in between the things you're actually looking for.
For example, this useful regexp mutes all tweets with three or more hashtags:
#.*#.*#
A tweet with one #hashtag
A tweet with #two #hashtags
> A tweet #with #three #hashtags
> A #tweet #with #four #hashtags
Whatever is between the hashes will be matched by .*
(and we
don't need to worry about matching anything after the third hash).
Returning to how this whole exercise got started: in early February someone observed that it was difficult to mute a certain Australian drug mule's name because no-one could remember how to spell it. Matching a bunch of variants is exactly where regexps come in handy:
[sS]?[cC]?hap+el+e
In addition to the character classes mentioned above (\w
, \d
and \s
), you can craft your own. Any group of characters in square brackets []
will be matched against a single occurrence of any of the characters. For example:
[sS] - matches "s" or "S"
[cC] - matches "c" or "C"
This highlights an important point about regexps: they are case-sensitive by default. While keyword mutes in Tweetbot are case-insensitive - "Abbott" will match "ABBOTT" or "Abbott" or "aBboTt" - once you activate the regexp switch, you need to remember that "s" and "S" are now being treated as different characters.
We can now break down the Corby filter:
[sS]? - nothing, or one "S" or "s"
[cC]? - nothing, or one "C" or "c"
ha - exactly "ha"
p+ - one or more occurences of "p"
e - exactly "e"
l+ - one or more occurences of "l"
e - exactly "e"
Here's how this pattern will work against some variants of Schappelle:
[sS]?[cC]?hap+el+e
> Schappelle
> Chapele
> SChappppppppppellllllle
> hapele
chapel
Note that because both the [Ss]
and [Cc]
parts are
optional, "hapele" matches.
Suppose I wanted to filter out tweets with repetitions of something bigger than a single character: for example, I might be OK with people saying "lol", but draw the line at "lolol", much less "lololololol". We can match a repeated group of characters by putting parentheses "()" around the group and then using "+":
(ol)+
- matches "ol", "olol", "ololol", etc.
lol(ol)+
- matches "lolol", "lololol", etc, but not "lol"
In section 3 I was kind of vague about what a "part" of a regexp was: we can be more definite now. A part is either a character, a "." or a character class, or a subpattern as grouped in parentheses. Any of these kinds of part can be followed by one of the repetition operators (+
, ?
, *
).
The grouping parentheses are often used in conjunction with the pipe
|
character, which is used to match alternatives. Rather than use
character classes to capture the variant schapellings of Schapelle, I could have
explicitly listed alternatives, as follows:
(Sch|Ch|Sh)ap+ell
> Schappelle
> Chapelle
appellate court
The alternatives inside a group don't have to be all ordinary characters - they can include any of the special characters, character classes or repetition operators we've seen so far, including alternatives and groups. Although this sort of thing is massive overkill if you're just filtering tweets, and leads to the kind of intractably unreadable regexps that have given them a bad reputation in certain circles.
An exercise for the reader: it should be possible to craft a mute filter which would match tweets which @-mention an Australian politician, by looking for all usernames containing parliamentary abbreviations like MP, MHR, MLC, etc. Let me know if you come up with one.
This tutorial only scratches the surface of what regexps can do, although it should let you build a lot of interesting and useful mute filters. Many aspects of regexp syntax and standard functionality, like capturing substrings of matched strings, have been left out entirely as they're not relevant to Twitter filtering. Most dynamic programming languages have a dialect of regexps built in or available in a library, and there are many books and references out there.
Tweetbot's mute function lets you test regexps against the tweets in your current timeline to see which ones match, and there are plenty of regexp testers on the web, like RegexPal.
I've tested all - well, most - of the above patterns in Tweetbot, but if you find any mistakes, or have any other comments or suggestions, let me know on Twitter at @spikelynch.