PC Plus HelpDesk - issue 261
 |
This month, Paul Grosse gives you more
insight into some of the topics dealt with in HelpDesk.
From the pages of HelpDesk, we look at:
- Just what is 'Crippleware';
- Moving Windows in KDE and GNOME;
- Automatic quote replacement in OpenOffice.org
Writer;
- Setting a program's priority;
- Sorting enormous lists;
- Extracting UserAgent strings from log files;
and,
- CGI scripting language.
|
 |
HelpDesk
Just what is 'Crippleware'
In
computing, you come across many different terms -
'freeware', 'shareware', 'postcardware' and so on - but
you might find some programs referred to as
'Crippleware'?
Many of these terms have been around for years but for
those people who are not entirely sure what they mean:
- Freeware is software that is still copyrighted
but costs nothing for the user to install or use;
- Shareware is fully-functional software that
again, is copyrighted. The user is free to
install it and use it but if it has been used for
a certain amount of time (maybe a week, month or
just a number of times) then a payment is
required for the software use to continue
(although true shareware programs will be
continue to be fully functional even if no
payment has been made). With shareware, there are
no 'register now' or 'only so many days left'
nag-screens at all;
- Postcardware is the same as shareware except that
payment is in the form of a post card - usually
of a place close to where the end user lives or
uses the software;
- There are also various licenses that come under
Free Software - the 'free' here referring to
freedom rather than cost - the source is open for
anybody to read and improve upon (this is not the
same as Freeware). This is also copyrighted but
it is licensed so that you are free to use it and
modify it. You can pass it on but you must
include the source code. The General Public
License is an example of this. and;
- Crippleware is like shareware only certain
features have been crippled until you make a
payment. These might be that: there is a time
limit, after which the program will cease to
function; you cannot save or print anything with
it; if it is a spreadsheet, you might be limited
to 100 cells; or, with a word processor you can
only print one page. This sounds as though it
might be all right and it could well be in some
cases but it could be used to hide serious bugs -
as the name suggests, it is frowned upon.
If the restriction is with time, it might be that it
accumulates files at an alarming rate and the user has no
way of controlling them. With saving, the files might be
horrendously large or insecure; with printing, it could
be that the writer of the program had no idea how to
interface effectively with printer drivers. With limits
such as 100 spreadsheet cells or one word processor page,
it could be that there are serious issues with memory or
speed.
The rule of thumb is that, if there are features that
are crippled, you need to ask why they have gone to the
trouble of doing it and, why don't they want you to see
or use those features?
|
Moving Windows in KDE and GNOME
The
multiple-desktop environments in KDE, Gnome and a number
of other GUIs make life a lot easier - allowing the user
to organise their work better, multi-task and increase
productivity. The user is able to switch to another
desktop to handle an immediate job and then just switch
back to the original job when necessary.
In this way, for example, you can be working on a
piece of writing on one desktop, processing images for it
on another and surfing the web on a third, all with your
email browser occupying the full screen on a fourth. If
someone comes along and wants you to edit a document (or
find out how high a score you can get in Frozen Bubble),
you just switch to an empty desktop and work there - or
however many you need.
As a computer journalist, I work in this way and when
I switch to a Windows environment, it seems positively
claustrophobic in comparison.
|
However, one thing you need to be able to do
is to move your windows from one desktop to another.Normally,
right-clicking somewhere on the border or title bar will
make a drop-down menu appear and moving to another
desktop is one of the options.
Alternatively, you can drag a window over the edge of
the desktop into the next one and if you are using the 3D
desktops from Beryl or Compiz (right, on SuSE 10.2), the
desktop cube revolves when this happens.
|
However, if you have the Launch Pager on
your panel, you can drag a window from any desktop and
drop it into any other - as in the screenshot.You
don't need to have either of them as the current desktop.
|
Automatic quote replacement in
OpenOffice.org Writer
If you
need your quotes to remain as primes, you need to stop
automatic quote replacement.
If you use OpenOffice.org (or Word, for that matter),
you might find that it has the habit of changing the
standard primes (' and ") for quotes so you end up
with '6's, '9's, '66's and '99's instead of what you
want.
The smart quote substitution in OOo Writer is quite
good at *English and American quotes, even when they are
parenthetic and mixed with apostrophes. However, if you
do a lot of typing that has little speech and includes
angles or times - such as astronomical reports - or
imperial measurements that include feet and inches, you
would probably be better off keeping them the way you
type them.
To edit the defaults for when you are typing, click on
'Tools', 'AutoCorrect...' and on the 'Custom Quotes' tab,
uncheck the 'Replace' checkboxes for each and then click
on 'OK'.
If you want to change other automatic replacements or
custom quotes when editing later, click on the 'Options'
tab before you click on 'OK' and edit the options at your
will.
* English quotes start with
single quotes. If there is a quote within a quote, it
has double quotes (and then back to single quotes for
a quote within that and so on). Hence, a single quote
has single quotation marks and a quote within a quote
(ie, it is double quoted) has double quotes. eg:
Mike said; Weve
had a lot of rain this afternoon, even though the
weatherman said; Its going to be
sunny, if I remember correctly.
American quotes start off with
double quotes for single quoted words and single
quotes for double quoted words like so...
Mike said; Weve
had a lot of rain this afternoon, even though the
weatherman said; Its going to be
sunny, if I remember correctly.
|
Setting a program's priority
It is
possible to run a program continuously for several days,
taking all of the processor's available power, without it
getting in the way of other programs that are running on
the computer. This is how it can be done
The reason that the other programs run as though the
long-running program is not there is because, as far as
the other programs are concerned, it isn't there.
Computers do not possess inertia so you can switch from
one process to another and back so that they appear to
run at the same time.
On the right, you can see the processor is loaded up
to the limit with a lower priority program and then,
around three quarters of the way through, another program
is started. It takes up the processing power that it
needs and the other program simply makes way for it. In
this way, the new, higher priority program runs as though
the other one was never there.
|
Normally, pretty much all of the programs
that you run will have the same priority and they will
compete with each other for processor time on an even
playing field.However, you can be more flexible than
that.
Some programming languages such as Perl and C will
allow you to set the priority of the program so that when
you run it, it will take all of the processor power there
is but if anything with a higher priority (programs that
aren't as nice) wants the processor, it can use it.
The normal programs that the user runs, such as the
GUI, word processor and so on, will all have priority
over it and therefore run at almost their normal speed -
the lower priority program isn't starved of processor
time completely.
By running the long-term program with a lower
priority, you can ensure that you won't be locked out of
normal work for several days by it.
|
Sorting enormous lists
If you collect data for some reason, you can find
yourself in a position of having to sort the data. For
small amounts of data, you might be tempted to use a
spreadsheet - these are graphical and can use sort
routines but there are limits.
Spreadsheets are useful for sorting files but sifting
out duplicates can be an annoyance. The main problem is
that you are usually limited to some power of two lines,
either 8192 or 65536 which is still short and even if you
did have a larger spreadsheet, it would be overkill for
this job. There is an easier way and it's free.
Perl is designed, amongst other things, to process
long strings and is used, for example, in chromosome
analysis where you might like to find certain sequence of
DNA bases in a gene, so it is quite capable handling a
log file with a million lines in it.
Here, you
will find a couple of programs that demonstrate how easy
it is to do this type of processing (Windows and UNIX
versions).
- The first program - 'filegen' - generates a
'log'-like file that is 1,000,000 lines long.
Each line has a probably-unique 32-bit hex code
and then a string of 100 bytes.
- The second program - 'sorter' - slurps this log
file and sorts them out according to the hex
code, eliminating any duplicates if there are any
and then saving the sorted log file. In the
screenshot, you can see that the program took
just under 12 seconds to sort out all 1,000,000
lines.
The key lines are where each log file line (in a
scalar called '$a') is split by spaces into an array
(called '@z') and then the first element of this array is
saved in an associative array (called a 'hash') called %h
as follows...
$h{$z[0]} = $a;
...which
stores the whole line using the hex code to sort it and
eliminating any duplicate keys. After this, the keys are
all sorted with the line...
@kl = sort (keys (%h));
...which puts the sorted key list into an array that
is then accessed sequentially.
If you just wanted to sort out the list without
eliminating any duplicates, you would just use an array
and sort that instead of using the hash.
This type of programming is useful if you want to look
at things like the number of user agents in your web
server log file and so on.
So, let's look at that...
|
Extracting UserAgent strings from
log files
A typical log file records a fair amount of data on
each transaction that a web server makes. Each line is
one transaction and the fields for each line are all
separated by spaces. Here is an example...
1
2
3
4
5
6
7
8
9
|
78.0.84.141
-
-
[05/Aug/2007:09:47:41 +0100]
"GET /wrc/wrc.gif HTTP/1.1"
200
513
"http://ourworld.compuserve.com/homepages/pagrosse/h2oRocketIndex.htm"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
|
- Field 1 is the IP address of the client;
- Field 4 is the date and time, along with any
difference from GMT (this is BST so it has one
hour added - to get the GMT time, take off the
+1000 time) - this field contains a space;
- Field 5 is the request from the browser - this
field contains two spaces;
- Field 6 is the server response code - 200 is
'OK';
- Field 7 is the length of the file sent - here,
513 bytes were sent;
- Field 8 contains the referring page;
- Field 9 contains the User Agent which can include
a lot of spaces as separators.
The part we are interested in is field 9. With the
spaces that have gone before it and field 1 counting as
zero, the first part of field 9 is element 11 like so...
0
1
2
3
4
5
6
7
8
9
10
11
12
...
|
78.0.84.141
-
-
[05/Aug/2007:09:47:41
+0100]
"GET
/wrc/wrc.gif
HTTP/1.1"
200
513
"http://ourworld.com ... ocketIndex.htm"
"Mozilla/5.0
(Windows;
...
|
So, let's look at the whole program (written
step by step so you can see what is going on - at the
end, you will see it in only 12 lines). You can see the
files by clicking here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
#!/usr/bin/perl
use warnings;
$n = 0;
open (LOG, "access_log");
while (<LOG>) {
$n++;
chomp;
$a = $_;
@z = split " ", $a;
$j = "@z[(11..$#z)]";
if (exists $h{$j}) {
$h{$j}++;
} else {
$h{$j} = 1;
}
}
close (LOG);
@kl = sort (keys (%h));
open (SRT, ">ua.txt");
foreach $y (@kl) {
print SRT "$h{$y} $y\n";
}
close (SRT);
1 while ($n =~ s/(\d+)(\d\d\d)/$1,$2/);
$m = $#kl + 1;
1 while ($m =~ s/(\d+)(\d\d\d)/$1,$2/);
print "Processed\n $n records.\n $m unique referrers.\n";
|
- Line 3 sets to zero the scalar that will count
the number of lines in the source log file;
- Line 4 opens the log file so that we can start
reading it in'
- Line 5 sets up a loop that will break when we
reach the end of the file unless it is exited
before. another thing that this line does is to
take the current line in the file and copy it
into the default scalar variable which is called $_;
- Line 6 increments the line counter;
- Line 7 trims the new line character off the end
of the default variable (no variable was passed
explicitly to chomp);
- Line 8 copies the line into the scalar $a
(for the purpose of clarity);
- Line 9 takes the contents of $a
and breaks it up wherever there is a space,
storing the pieces in the array @z;
- Line 10 takes the last elements from @z,
starting at element 11 (the 12th one -
remember that it starts at zero) and going to the
last one (the term '(11..$#z)'
produces a list of numbers so if $#z
equalled 14, it would produce '11 12 13 14').
This is in quotes so the output is a space
separated list of those elements in the array @z
in the same way that it would have been if we
used 'join' but here, we can be more flexible if
needed (we could have specified (1..3 5..9 6 4)
if we wanted to).This list is then stored in $j;
- Lines 11 through to 15 inclusive check for the
existence of a particular hash key (our hash is
called %h.). If it doesn't
exist, it creates it and puts a '1' in it (14).
If it does exist, it adds one to the number that
is already there (12)
In this way, if there is a duplicate, it is
eliminated but, in doing so, it also adds one to
its count;
- Line 16 is the end of the while loop. If there
are more lines to read in the file, it goes back
to line 5,
otherwise, it carries on;
- Line 17 closes the file. Now, we can start
processing the data that we have collected;
- Line 18 takes all of the keys (using the 'keys'
command) and then sorts them (using the 'sort'
command) and then stores that in the key list
array which I've called @kl.
(You could call it anything you like);
- Line 19 sees the start of saving the data. We
open up a file called ua.txt and give it the
handle 'SRT';
- Line 20 places in a variable $y, each of the keys
in the hash, in order;
- Line 21 prints out (to the file) the value in the
hash (which is the frequency for that particular
UserAgent), then a space, then the UserAgent
string;
- Line 23 closes the file and then it is time to
put some information on the terminal. We want to
show how many lines there were in the original
log file and how many unique user agents we
managed to find. We also want them to be human
readable so we will insert commas every third
digit from the right;
- Line 24 takes our count of the total number of
lines, $n, and feeds it
repetitively through a substitution regular
expression. Each time it finds any number of
digits more than one, followed by three digits,
it cuts the first chunk into a variable called $1
and the last three digits into a variable called $2.
Then, it puts them back where it found them but
adding a comma between them.
1 while ($n =~ s/(\d+)(\d\d\d)/$1,$2/)
works like this...
'while' will loop as long as the condition is met.
The condition is in the round brackets and the action
is the 1. It would work just as well with a 2 because
the 1 (or whatever you want there) is a constant and
nothing is done with it so it is just removed when
the program is compiled.
If we have a number 12345678 in $n,
we feed it into the left hand half of s/(\d+)(\d\d\d)/$1,$2/
(ie, (\d+)(\d\d\d))and in $1,
we get 12345 and in $2,
we get 678. The right half
reassembles it so that we then get 12345,678
in $n. This was successful so the
process gets a non-zero value which means that the
while process happens again.
The second time around, we end up with 12,345,678
in $n. Again, this is successful.
However, the third time we go around the process is a
failure because there are not four or more digits at
the beginning. With a failure, the while loop (in
this case the constant '1') is exited and the program
proceeds with the next statement.
- Line 25 puts the highest index of the key list,
plus one, into a variable called $m. This is the
number of unique UserAgents we have found;
- Line 26 processes $m to make it
readable as well; and,
- Finally, Line 27 prints it out so we can see how
well our program has done.
The only
place this falls down is where a non-standard proxy
introduces spaces in the string that it sends as a
referrer.
As for the speed of this program, you can run it with
the 'time' command and as you can see from the
screenshot, with real log data, a 160,000 record log file
produced just over 2,400 unique UserAgents in nearly 5
seconds.
The type of output from this program looks like
this...
|
138 "-"
4 "Avant Browser (http://www.avantbrowser.com)"
3 "BlackBerry7130e/4.1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/109"
...
1 "DataCha0s/2.0"
140 "FDM 2.x"
6 "Factbot 1.09"
1 "GetRight/6.3"
4644 "Googlebot-Image/1.0"
...
2 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Sky Broadband; InfoPath.2)"
13 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Smart Explorer 6.1; (R1 1.3); .NET CLR 2.0.50727)"
13 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TARGET HQ; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)"
53 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Tablet PC 1.7; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"
...
1 "Mozilla/5.0 (Windows; U; Windows NT 6.0; sv-SE; rv:1.8.1.5) Gecko/20070713 Firefox/2.0.0.5"
29 "Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.1.4) Gecko/20070629 Firefox/2.0.0.4"
7 "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.8.1.3) Gecko/20070310 Iceweasel/2.0.0.3 (Debian-2.0.0.3-1)"
1 "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty)"
181 "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-GB; rv:1.8.1) Gecko/20061023 SUSE/2.0-30 Firefox/2.0"
and so on.
If you don't want the frequency data or the console
output, you can omit lines 3, 6, 11, 12, 13, 15, the '$h{$y}'
term in line 21, 24, 25, 26 and 27. Also, we can use the
default scalar with split instead of
converting $_ to $a and
also use the default whitespace in split instead of "
" (this cuts out 4 characters - this is
programming) so we can remove line 8 and modify line 9
like so...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
#!/usr/bin/perl
use warnings;
open (LOG, "access_log");
while (<LOG>) {
chomp;
@z = split;
$j = "@z[(11..$#z)]";
$h{$j} = 1;
}
close (LOG);
@kl = sort (keys (%h));
open (SRT, ">ua.txt");
foreach $y (@kl) {
print SRT "$y\n";
}
close (SRT);
|
It's now only 16 lines long and some of those
only have a closing brace on them.
Out of curiosity, if this was a competition to see how
few a number of lines we could get this down to (and we
weren't bothered about speed), we could; eliminate line
2, using instead '-w' on the first line; eliminate $j
by condensing lines 6 and 7; and, remove the foreach loop
at the end by joining @kl with a new
line escape sequence and then eliminate @kl
all together by condensing those two lines like so...
1
2
3
4
5
6
7
8
9
10
11
12
|
#!/usr/bin/perl -w
open (LOG, "access_log");
while (<LOG>) {
chomp;
@z = split;
$h{"@z[(11..$#z)]"} = 1;
}
close (LOG);
$b = join "\n", sort (keys (%h));
open (SRT, ">ua.txt");
print SRT $b;
close (SRT);
|
... and now it is down to 12 lines. However,
this does take just over 9 seconds to run using the same
log file.
|
CGI scripting language
There are a number of languages that will run CGI
(Common Gateway Interface) scripts but some are safer
than others. Whilst most people will be able to write a
script that can take input that is expected and process
it well, the job of hardening a script so that it is
secure enough to allow the public access to it, is
another thing all together.
The public includes honest users, just surfing the
Internet with their browsers and, dishonest users who
might not even be using a browser at all. This latter
group will try to bombard your scripts with values that
are out of range, the wrong type of variable, strings
that are too long, other scripts and so on. They will try
to hack your database and plant stuff on your site.
A language
that is outstanding in its ability to withstand this type
of abuse is Perl ( http://www.perl.com/ or, if you are
running Windows, click on the 'ActiveState' link). It is
flexible, robust and very easy to learn - it's even used
to learn how to program. Apache has it built in and if
you are building large CGIs, there are ready-made modules
and an extremely large support base on the Web.
Unlike other languages, Perl won't bail out because
someone sent it a floating point or a string that was
60,000 bytes long when it was expecting an integer. It is
well worth getting to know Perl because most of the web
uses it - it also pops up in unexpected places as well
such as puzzle program generators and the game Frozen
Bubble.
And yes, I have got past level 70.
|
Back to PC Plus Archive Index Page
|