PC Plus HelpDesk - issue 261

Paul Grosse This month, Paul Grosse gives you more insight into some of the topics dealt with in HelpDesk.

From the pages of HelpDesk, we look at:

  • Just what is 'Crippleware';
  • Moving Windows in KDE and GNOME;
  • Automatic quote replacement in OpenOffice.org Writer;
  • Setting a program's priority;
  • Sorting enormous lists;
  • Extracting UserAgent strings from log files; and,
  • CGI scripting language.

HelpDesk

Just what is 'Crippleware'

In computing, you come across many different terms - 'freeware', 'shareware', 'postcardware' and so on - but you might find some programs referred to as 'Crippleware'?

Many of these terms have been around for years but for those people who are not entirely sure what they mean:

  • Freeware is software that is still copyrighted but costs nothing for the user to install or use;
  • Shareware is fully-functional software that again, is copyrighted. The user is free to install it and use it but if it has been used for a certain amount of time (maybe a week, month or just a number of times) then a payment is required for the software use to continue (although true shareware programs will be continue to be fully functional even if no payment has been made). With shareware, there are no 'register now' or 'only so many days left' nag-screens at all;
  • Postcardware is the same as shareware except that payment is in the form of a post card - usually of a place close to where the end user lives or uses the software;
  • There are also various licenses that come under Free Software - the 'free' here referring to freedom rather than cost - the source is open for anybody to read and improve upon (this is not the same as Freeware). This is also copyrighted but it is licensed so that you are free to use it and modify it. You can pass it on but you must include the source code. The General Public License is an example of this. and;
  • Crippleware is like shareware only certain features have been crippled until you make a payment. These might be that: there is a time limit, after which the program will cease to function; you cannot save or print anything with it; if it is a spreadsheet, you might be limited to 100 cells; or, with a word processor you can only print one page. This sounds as though it might be all right and it could well be in some cases but it could be used to hide serious bugs - as the name suggests, it is frowned upon.

If the restriction is with time, it might be that it accumulates files at an alarming rate and the user has no way of controlling them. With saving, the files might be horrendously large or insecure; with printing, it could be that the writer of the program had no idea how to interface effectively with printer drivers. With limits such as 100 spreadsheet cells or one word processor page, it could be that there are serious issues with memory or speed.

The rule of thumb is that, if there are features that are crippled, you need to ask why they have gone to the trouble of doing it and, why don't they want you to see or use those features?


Moving Windows in KDE and GNOME

The multiple-desktop environments in KDE, Gnome and a number of other GUIs make life a lot easier - allowing the user to organise their work better, multi-task and increase productivity. The user is able to switch to another desktop to handle an immediate job and then just switch back to the original job when necessary.

In this way, for example, you can be working on a piece of writing on one desktop, processing images for it on another and surfing the web on a third, all with your email browser occupying the full screen on a fourth. If someone comes along and wants you to edit a document (or find out how high a score you can get in Frozen Bubble), you just switch to an empty desktop and work there - or however many you need.

As a computer journalist, I work in this way and when I switch to a Windows environment, it seems positively claustrophobic in comparison.

However, one thing you need to be able to do is to move your windows from one desktop to another.

Normally, right-clicking somewhere on the border or title bar will make a drop-down menu appear and moving to another desktop is one of the options.

Alternatively, you can drag a window over the edge of the desktop into the next one and if you are using the 3D desktops from Beryl or Compiz (right, on SuSE 10.2), the desktop cube revolves when this happens.

However, if you have the Launch Pager on your panel, you can drag a window from any desktop and drop it into any other - as in the screenshot.

You don't need to have either of them as the current desktop.


Automatic quote replacement in OpenOffice.org Writer

If you need your quotes to remain as primes, you need to stop automatic quote replacement.

If you use OpenOffice.org (or Word, for that matter), you might find that it has the habit of changing the standard primes (' and ") for quotes so you end up with '6's, '9's, '66's and '99's instead of what you want.

The smart quote substitution in OOo Writer is quite good at *English and American quotes, even when they are parenthetic and mixed with apostrophes. However, if you do a lot of typing that has little speech and includes angles or times - such as astronomical reports - or imperial measurements that include feet and inches, you would probably be better off keeping them the way you type them.

To edit the defaults for when you are typing, click on 'Tools', 'AutoCorrect...' and on the 'Custom Quotes' tab, uncheck the 'Replace' checkboxes for each and then click on 'OK'.

If you want to change other automatic replacements or custom quotes when editing later, click on the 'Options' tab before you click on 'OK' and edit the options at your will.

* English quotes start with single quotes. If there is a quote within a quote, it has double quotes (and then back to single quotes for a quote within that and so on). Hence, a single quote has single quotation marks and a quote within a quote (ie, it is double quoted) has double quotes. eg:

Mike said; ‘We’ve had a lot of rain this afternoon, even though the weatherman said; “It’s going to be sunny,” if I remember correctly.’

American quotes start off with double quotes for single quoted words and single quotes for double quoted words like so...

Mike said; “We’ve had a lot of rain this afternoon, even though the weatherman said; ‘It’s going to be sunny,’ if I remember correctly.”


Setting a program's priority

It is possible to run a program continuously for several days, taking all of the processor's available power, without it getting in the way of other programs that are running on the computer. This is how it can be done

The reason that the other programs run as though the long-running program is not there is because, as far as the other programs are concerned, it isn't there. Computers do not possess inertia so you can switch from one process to another and back so that they appear to run at the same time.

On the right, you can see the processor is loaded up to the limit with a lower priority program and then, around three quarters of the way through, another program is started. It takes up the processing power that it needs and the other program simply makes way for it. In this way, the new, higher priority program runs as though the other one was never there.

Normally, pretty much all of the programs that you run will have the same priority and they will compete with each other for processor time on an even playing field.

However, you can be more flexible than that.

Some programming languages such as Perl and C will allow you to set the priority of the program so that when you run it, it will take all of the processor power there is but if anything with a higher priority (programs that aren't as nice) wants the processor, it can use it.

The normal programs that the user runs, such as the GUI, word processor and so on, will all have priority over it and therefore run at almost their normal speed - the lower priority program isn't starved of processor time completely.

By running the long-term program with a lower priority, you can ensure that you won't be locked out of normal work for several days by it.


Sorting enormous lists

If you collect data for some reason, you can find yourself in a position of having to sort the data. For small amounts of data, you might be tempted to use a spreadsheet - these are graphical and can use sort routines but there are limits.

Spreadsheets are useful for sorting files but sifting out duplicates can be an annoyance. The main problem is that you are usually limited to some power of two lines, either 8192 or 65536 which is still short and even if you did have a larger spreadsheet, it would be overkill for this job. There is an easier way and it's free.

Perl is designed, amongst other things, to process long strings and is used, for example, in chromosome analysis where you might like to find certain sequence of DNA bases in a gene, so it is quite capable handling a log file with a million lines in it.

Here, you will find a couple of programs that demonstrate how easy it is to do this type of processing (Windows and UNIX versions).

  • The first program - 'filegen' - generates a 'log'-like file that is 1,000,000 lines long. Each line has a probably-unique 32-bit hex code and then a string of 100 bytes.
  • The second program - 'sorter' - slurps this log file and sorts them out according to the hex code, eliminating any duplicates if there are any and then saving the sorted log file. In the screenshot, you can see that the program took just under 12 seconds to sort out all 1,000,000 lines.

The key lines are where each log file line (in a scalar called '$a') is split by spaces into an array (called '@z') and then the first element of this array is saved in an associative array (called a 'hash') called %h as follows...

$h{$z[0]} = $a;

...which stores the whole line using the hex code to sort it and eliminating any duplicate keys. After this, the keys are all sorted with the line...

@kl = sort (keys (%h));

...which puts the sorted key list into an array that is then accessed sequentially.

If you just wanted to sort out the list without eliminating any duplicates, you would just use an array and sort that instead of using the hash.

This type of programming is useful if you want to look at things like the number of user agents in your web server log file and so on.

So, let's look at that...


Extracting UserAgent strings from log files

A typical log file records a fair amount of data on each transaction that a web server makes. Each line is one transaction and the fields for each line are all separated by spaces. Here is an example...

1
2
3
4
5
6
7
8
9
78.0.84.141
-
-
[05/Aug/2007:09:47:41 +0100]
"GET /wrc/wrc.gif HTTP/1.1"
200
513
"http://ourworld.compuserve.com/homepages/pagrosse/h2oRocketIndex.htm"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
  • Field 1 is the IP address of the client;
  • Field 4 is the date and time, along with any difference from GMT (this is BST so it has one hour added - to get the GMT time, take off the +1000 time) - this field contains a space;
  • Field 5 is the request from the browser - this field contains two spaces;
  • Field 6 is the server response code - 200 is 'OK';
  • Field 7 is the length of the file sent - here, 513 bytes were sent;
  • Field 8 contains the referring page;
  • Field 9 contains the User Agent which can include a lot of spaces as separators.

The part we are interested in is field 9. With the spaces that have gone before it and field 1 counting as zero, the first part of field 9 is element 11 like so...

0
1
2
3
4
5
6
7
8
9
10
11
12
...
78.0.84.141
-
-
[05/Aug/2007:09:47:41
+0100]
"GET
/wrc/wrc.gif
HTTP/1.1"
200
513
"http://ourworld.com ... ocketIndex.htm"
"Mozilla/5.0
(Windows;
...

So, let's look at the whole program (written step by step so you can see what is going on - at the end, you will see it in only 12 lines). You can see the files by clicking here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/perl
use warnings;
$n = 0;
open (LOG, "access_log");
  while (<LOG>) {
    $n++;
    chomp;
    $a = $_;
    @z = split " ", $a;
    $j = "@z[(11..$#z)]";
    if (exists $h{$j}) {
      $h{$j}++;
    } else {
      $h{$j} = 1;
    }
  }
close (LOG);
@kl = sort (keys (%h));
open (SRT, ">ua.txt");
  foreach $y (@kl) {
    print SRT "$h{$y} $y\n";
  }
close (SRT);
1 while ($n =~ s/(\d+)(\d\d\d)/$1,$2/);
$m = $#kl + 1;
1 while ($m =~ s/(\d+)(\d\d\d)/$1,$2/);
print "Processed\n $n records.\n $m unique referrers.\n";
  • Line 3 sets to zero the scalar that will count the number of lines in the source log file;
  • Line 4 opens the log file so that we can start reading it in'
  • Line 5 sets up a loop that will break when we reach the end of the file unless it is exited before. another thing that this line does is to take the current line in the file and copy it into the default scalar variable which is called $_;
  • Line 6 increments the line counter;
  • Line 7 trims the new line character off the end of the default variable (no variable was passed explicitly to chomp);
  • Line 8 copies the line into the scalar $a (for the purpose of clarity);
  • Line 9 takes the contents of $a and breaks it up wherever there is a space, storing the pieces in the array @z;
  • Line 10 takes the last elements from @z, starting at element 11 (the 12th one - remember that it starts at zero) and going to the last one (the term '(11..$#z)' produces a list of numbers so if $#z equalled 14, it would produce '11 12 13 14'). This is in quotes so the output is a space separated list of those elements in the array @z in the same way that it would have been if we used 'join' but here, we can be more flexible if needed (we could have specified (1..3 5..9 6 4) if we wanted to).This list is then stored in $j;
  • Lines 11 through to 15 inclusive check for the existence of a particular hash key (our hash is called %h.). If it doesn't exist, it creates it and puts a '1' in it (14). If it does exist, it adds one to the number that is already there (12) In this way, if there is a duplicate, it is eliminated but, in doing so, it also adds one to its count;
  • Line 16 is the end of the while loop. If there are more lines to read in the file, it goes back to line 5, otherwise, it carries on;
  • Line 17 closes the file. Now, we can start processing the data that we have collected;
  • Line 18 takes all of the keys (using the 'keys' command) and then sorts them (using the 'sort' command) and then stores that in the key list array which I've called @kl. (You could call it anything you like);
  • Line 19 sees the start of saving the data. We open up a file called ua.txt and give it the handle 'SRT';
  • Line 20 places in a variable $y, each of the keys in the hash, in order;
  • Line 21 prints out (to the file) the value in the hash (which is the frequency for that particular UserAgent), then a space, then the UserAgent string;
  • Line 23 closes the file and then it is time to put some information on the terminal. We want to show how many lines there were in the original log file and how many unique user agents we managed to find. We also want them to be human readable so we will insert commas every third digit from the right;
  • Line 24 takes our count of the total number of lines, $n, and feeds it repetitively through a substitution regular expression. Each time it finds any number of digits more than one, followed by three digits, it cuts the first chunk into a variable called $1 and the last three digits into a variable called $2. Then, it puts them back where it found them but adding a comma between them.

1 while ($n =~ s/(\d+)(\d\d\d)/$1,$2/) works like this...

'while' will loop as long as the condition is met. The condition is in the round brackets and the action is the 1. It would work just as well with a 2 because the 1 (or whatever you want there) is a constant and nothing is done with it so it is just removed when the program is compiled.

If we have a number 12345678 in $n, we feed it into the left hand half of s/(\d+)(\d\d\d)/$1,$2/ (ie, (\d+)(\d\d\d))and in $1, we get 12345 and in $2, we get 678. The right half reassembles it so that we then get 12345,678 in $n. This was successful so the process gets a non-zero value which means that the while process happens again.

The second time around, we end up with 12,345,678 in $n. Again, this is successful. However, the third time we go around the process is a failure because there are not four or more digits at the beginning. With a failure, the while loop (in this case the constant '1') is exited and the program proceeds with the next statement.

  • Line 25 puts the highest index of the key list, plus one, into a variable called $m. This is the number of unique UserAgents we have found;
  • Line 26 processes $m to make it readable as well; and,
  • Finally, Line 27 prints it out so we can see how well our program has done.

The only place this falls down is where a non-standard proxy introduces spaces in the string that it sends as a referrer.

As for the speed of this program, you can run it with the 'time' command and as you can see from the screenshot, with real log data, a 160,000 record log file produced just over 2,400 unique UserAgents in nearly 5 seconds.

The type of output from this program looks like this...

138 "-"
4 "Avant Browser (http://www.avantbrowser.com)"
3 "BlackBerry7130e/4.1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/109"
...
1 "DataCha0s/2.0"
140 "FDM 2.x"
6 "Factbot 1.09"
1 "GetRight/6.3"
4644 "Googlebot-Image/1.0"
...
2 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Sky Broadband; InfoPath.2)"
13 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Smart Explorer 6.1; (R1 1.3); .NET CLR 2.0.50727)"
13 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TARGET HQ; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)"
53 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Tablet PC 1.7; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"
...
1 "Mozilla/5.0 (Windows; U; Windows NT 6.0; sv-SE; rv:1.8.1.5) Gecko/20070713 Firefox/2.0.0.5"
29 "Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.1.4) Gecko/20070629 Firefox/2.0.0.4"
7 "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.8.1.3) Gecko/20070310 Iceweasel/2.0.0.3 (Debian-2.0.0.3-1)"
1 "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty)"
181 "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-GB; rv:1.8.1) Gecko/20061023 SUSE/2.0-30 Firefox/2.0"

and so on.

If you don't want the frequency data or the console output, you can omit lines 3, 6, 11, 12, 13, 15, the '$h{$y}' term in line 21, 24, 25, 26 and 27. Also, we can use the default scalar with split instead of converting $_ to $a and also use the default whitespace in split instead of " " (this cuts out 4 characters - this is programming) so we can remove line 8 and modify line 9 like so...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/perl
use warnings;
open (LOG, "access_log");
  while (<LOG>) {
    chomp;
    @z = split;
    $j = "@z[(11..$#z)]";
    $h{$j} = 1;
  }
close (LOG);
@kl = sort (keys (%h));
open (SRT, ">ua.txt");
  foreach $y (@kl) {
    print SRT "$y\n";
  }
close (SRT);

It's now only 16 lines long and some of those only have a closing brace on them.

Out of curiosity, if this was a competition to see how few a number of lines we could get this down to (and we weren't bothered about speed), we could; eliminate line 2, using instead '-w' on the first line; eliminate $j by condensing lines 6 and 7; and, remove the foreach loop at the end by joining @kl with a new line escape sequence and then eliminate @kl all together by condensing those two lines like so...

1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/perl -w
open (LOG, "access_log");
  while (<LOG>) {
    chomp;
    @z = split;
    $h{"@z[(11..$#z)]"} = 1;
  }
close (LOG);
$b = join "\n", sort (keys (%h));
open (SRT, ">ua.txt");
  print SRT $b;
close (SRT);

... and now it is down to 12 lines. However, this does take just over 9 seconds to run using the same log file.


CGI scripting language

There are a number of languages that will run CGI (Common Gateway Interface) scripts but some are safer than others. Whilst most people will be able to write a script that can take input that is expected and process it well, the job of hardening a script so that it is secure enough to allow the public access to it, is another thing all together.

The public includes honest users, just surfing the Internet with their browsers and, dishonest users who might not even be using a browser at all. This latter group will try to bombard your scripts with values that are out of range, the wrong type of variable, strings that are too long, other scripts and so on. They will try to hack your database and plant stuff on your site.

A language that is outstanding in its ability to withstand this type of abuse is Perl ( http://www.perl.com/ or, if you are running Windows, click on the 'ActiveState' link). It is flexible, robust and very easy to learn - it's even used to learn how to program. Apache has it built in and if you are building large CGIs, there are ready-made modules and an extremely large support base on the Web.

Unlike other languages, Perl won't bail out because someone sent it a floating point or a string that was 60,000 bytes long when it was expecting an integer. It is well worth getting to know Perl because most of the web uses it - it also pops up in unexpected places as well such as puzzle program generators and the game Frozen Bubble.

And yes, I have got past level 70.

Back to PC Plus Archive Index Page