Super-searches with Regular Expressions
Searches through log files using the standard 'Find'
function on most text editors can be a little limited.
Normally, you can only look for one block at a time or
you have to learn some proprietary language to access the
inner workings of the editor. However, with regular
expressions, you only have to learn how to use a common
'language' that applies to all regular expression
situations, whether that is in a simple text editor, a
word processor such as OpenOffice.org, a file finder or a
programming language such as Perl.
You can open a new browser window to see a real log
file extract that you can play around with by clicking here. These are
extracts (one with UNIX line ends (suitable for Linux/BSD
and so on) and one with DOS line ends that you can open
in Windows). Once they are in your text editor, you can
play around with the regular expressions. The extract
itself (extracted using 'grep') is from a failed
'script-kiddie' attack on a web server (wrong OS, server
and patch level for the attack to succeed) and it is
interesting to see how they try to compromise a server
(in this case, they thought is was an unpatched IIS
server - if they thought is was anything).
Say that we are interested in the occurrences of lines
where they try to use 'index' with 'passwd'. If we use a
search string of 'index', we get 69 occurrences. If we
look for 'passwd', we get 40 occurrences. Matching these
up with each other would be time consuming so we use a
better method - this time looking for both strings.
In the window that you would normally type our search
string, we can specify a number of other strings that
allow parts to be tied down either to the beginning or
the end of each line (or paragraph if you are using a
word processor) and specify various types of characters
that go in between.
If we were looking at the whole log as opposed to an
extract, we might want to start off with the IP address
in question (this being a standard log file format, the
IP addrress is at the beginning). The character that ties
a string to the beginning of a line is the caret (^). So,
if we were looking for an ip address 67.176.149.209 at
the beginning of each line, we could say...
^67.176.149.209
and that would find all of the lines with
67.176.149.209 at the beginning. However, in regular
expressions, the full stop is used as a character that
represents any character so this would also catch lines
that start with 67317641495209 although you wouldn't see
these at the beginning of a log file like this so we are
safe here. To represent a dot for our dotted quad, we
need to escape it using a backslash to make '\.' so,
although our line above would work in a web server log
file, it would be more correct to say
^67\.176\.149\.209
and now, we only find the lines that start with
67.176.149.209.
If we found that they had been using a number of
address values between 67.176.149.0 and 67.176.149.255,
we could truncate our specification to
^67\.176\.149 and that would find them. Note that, as
we have specified that the start of this string should be
at the beginning of the line, it will not pick up on IP
addresses such as 23.67.176.149 as that doesn't match.
If a range of 67.176.149.200 to 67.176.149.239 was
used and there are other lines in there, say from
67.176.149.143, that you are not interested in, you can
look for the 200 to 239 lines by using the following...
^67\.176\.149\.2[0123]\d
At the end of this, you will see that we look for a
'2' and then, one of the characters in the square
brackets (0 to 3 inclusive) and then any digit '\d'.
There are a number of these character type
representations using escape codes.
| Numbers |
[0..9] |
\d |
| Space or equvalent |
{space}{tab} and so on |
\s |
| word characters |
[a..zA..Z] |
\w |
and if you use the uppercase version,
it means not that so \D is a character other than 'a
digit from 0 to 9 inclusive'.
You can have more than one as well. In DOS, we are
familiar with using an asterisk (*) to represent a number
of characters and here, if we follow a specification for
a character, it has a similar meaning.
| character |
example use |
results |
meaning |
| * |
[0..8]* |
'', '5', '15432157832151' |
any number of characters
occurring from 0 to infinity times |
| + |
[0..9a..f]+ |
'0', 'deadbeefcafe5' |
any number of characters
occurring from 1 to infinity times |
| ? |
[x..z]? |
'', 'y' |
any of the characters
occurring
just once or not at all |
Instead of just a single character (in
square brackets) you can have a set of characters such in
round brackets such as (na) so, you can look for
ba(na)+ which will find 'bana', 'banana' and
'bananana' but not 'ba' in the same way that '+' is used
above.
You can also specify a number of times something
should appear so if you want to find banana 'using' the
above regex, just modify it to ba(na){2,2} which will
find 'banana' and then the beginning of
'bananana' - the first 2 is the minimum
number of repeats required and the last 2 is the maximum
number so if you wanted between 2 and 4 inclusive, you
would use {2,4}.
If you want to limit it to a whole word with two or
three occurrences of 'na', you can use a word boundary
using an escaped b (\b) like so ...
\bba(na){2,3}\b
... which will pick up banana and bananana but none of
the others. If you want 'this' or 'that', you can use a
split vertical bar in the round brackets either like ...
(this|that)
... or, if you are more cunning ...
th(is|at)
To tie a string to the beginning of the
line/paragraph, we have already seen that you use a caret
(^) but to tie it to the end, you use a dollar sign ($).
So, if we wanted a line that started with a '67.176', had
a number of characters then either 'index' or 'cgi-bin',
then 'passwd', we could use ...
^67\.176\w+(index|cgi-bin)\w+passwd
See if you can work out what we need for the original
problem.
|