Automatically interacting with websites

here are some examples of how I create scripts to automatically do stuff with various websites.
not all of these scripts work, some never have, just experiments.

1. sneeuwhoogten
2. multiguide
3. boetebase
4. google
5. webobjects
6. omroepnl
7. anwb
8. egroups
9. egroups
10. foksuk
11. kieswijzer
12. hotmail
13. trafficnet
14. girotel
15. sneak
16. cia
17. maps
18. chicon
19. wetenschapskwis
20. phpbb

1. sneeuwhoogten

this is a very simple shell script that I scheduled with cron. I used it a couple of years ago to monitor snow heights in the czech republic.

#!/bin/sh
fn=/home/itsme/prj/sneeuw/logs/`date +%Y%m%d`.$$
GET http://verkeer2.anwb.org/ash/Tsjechie.html >$fn

2. multiguide

start of script not shown.
attempt at getting tv show information from a website.

my $station="46";
my $date="011027";

my $ua =LWP::UserAgent->new();
$ua->agent("Mozilla/4.76 [en] (X11; U; Linux 2.4.9 i686)");
my $jar= HTTP::Cookies->new();

my $rp1= $ua->request(GET "http://www.veronica.nl/cgi-bin/html/multiguide/show");
$jar->extract_cookies($rp1);

for my $page (qw(top complete_data bottom)) {
    my $rq= GET "http://www.veronica.nl/cgi-bin/html/multiguide/$page/tv/$station/$date/";
    $jar->add_cookie_header($rq);
    #$rq->header("Accept" => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*");
    #$rq->header("Accept-Language" => "en");
    #$rq->header("Accept-Encoding" => "gzip");
    #$rq->header("Accept-Charset" => "iso-8859-1,*,utf-8");
    #$rq->header("Pragma" => "no-cache");

    my $rp= $ua->request($rq);

    print $rp->content, "\n";
}

3. boetebase

another unfinished script, an attempt to parse an online database with information about how various offenses are fined. script linked here

4. google

example scripts of how to parse google output.

sort of brute force search many possible combinations of a searchterm. this example shows how to search for copies of IDA.
recover data for xda-developers from the google cache: googlexdadev.pl.

5. webobjects

here is a generic script that can be used to interact with webobjects-based servers. the general layout of webobjects urls is as follows:

http://<hostname>/cgi-bin/WebObjects/<applicationname.woa>/[ <instanceid> / ]<actiontype>
actiontype can be:
    wa = WODirectActionRequestHandler 
        .../wa[/<classname>][/<actionmethod>]
         -> calls "<actionmethod>Action" on <classname>
         [ or "defaultAction" if no action specified ]
         [ or action on class DirectAction ]
    wo = WOComponentRequestHandler
        .../wo/[<pagename>/]<sessionid>/<contextid>.<elementid>
    WebServerResources = WOResourceRequestHandler

the instanceid is a number specifying which server your request is to be executed on.
the sessionid is a string of letters and digits.
the context id is a number that increments with each /wo/ request
the element id is a unique 'dotted' number identifying a specific componenent in the application

6. omroepnl

and yet another unfinished to get radio and tv-show information.

script to get the current trafic intensities for the west of holland. I took about 2 weeks worth of these pictures and put them together to for a time-lapse movie of traffic.
this script was scheduled every 15 minutes using cron.

#!/usr/bin/perl -w

use strict;

use POSIX;
use Time::Local;

use LWP::UserAgent;
use HTTP::Request::Common qw(POST GET);
use LWP::Simple;
use URI;
use Digest::MD5 qw(md5_hex);

chdir "/home/itsme/prj/sites/anwb/archive";

my $homepage=get "http://www.anwb.nl/servlet/Satellite?pagename=OpenMarket/ANWB_verkeer/PopupVerkeer&regio=randstad";

my ($imgfile)= ($homepage =~ m{", $filename or die "open: $filename : $!\n";
    print IMG $img;
    close IMG;
}

8. egroups

see here for my other page on egroups ( or yahougroups as it is currently called )
this is a script intended to make a copy of a mailinglist archive
this script was never quite finished, the 'login' part is still missing. It may be circumvented by loging in manually in a browser, and then copying the cookie to this script.

10. foksuk

script to archive the fokke+sukke cartoons
first there was this script, combined with this script to create indexes.
later both were combined in this perl script.

11. kieswijzer

see this page for more information on these scripts.

12. hotmail

script to login to hotmail, and (sort of) list contents of the mailbox. I had plans to write an automated hotmail account creator, but this has become more difficult since microsoft is now using captcha's to prevent scripted registration software. there are a few possible ways around this though.

visual captcha's can be broken
I noticed the number of different captcha's returned is limited. as if they keep a small number of valid captcha's around for a couple of minutes. making it possible to create many accounts by manually recognizing just 1 captch

there are other beter, more finished scripts, like httpmail.

13. trafficnet

this is a combination of a simple scheduled job

#!/bin/sh
cd /home/itsme/prj/sites/trafficnet
name=`date +%Y%m%d%H%M.%W.%w`
/usr/local/bin/wget -a trafficnet.log -O "daily/$name.html" -N http://maps.trafficnet.nl/asp/trafficstats.asp

and this script to create an overview of it.

14. girotel

this project started out by analysing the protocol used by the online banking system of the 'postbank'.
later they made their service available over the internet, leading me to create these scripts

15. sneak

Here I try to predict what movie will be playing here in delft next week. I don't think I ever guessed correctly. this script will make it easier to make wrong guesses based on hard data.

16. cia

here a unfinished script for parsing parts of the cia worldfact book

17. maps

here are some attempts to create bigger maps from small maps deliverd by some websites. One problem I encountered, is that big maps are square, while big parts of the earth are not. so it is impossible to match them up accurately.

this script is in vbscript, and uses geodan to display the netherlands
this script is in perl, and uses routenet to display the netherlands

18. chicon

here is a script to sort certain items from my local computer hardware store by price per significant attribute. ( like speed for cpu's, and mb for ram, and gb for hd's )

19. wetenschapskwis

here is a script to automatically take part in the vpro wetenschapskwis. it should get 'smarter' if you let it run longer.

improved version that keeps track of known answers in a small database here, this version also identifies itself to the server as 'wetenschapsbot'

20. phpbb

this script I wrote to parse phpbb forum articles from html pages recoverd from browser caches, and google: parsephpbbforum.pl

how to avoid being scriptable

use variable grammar, by generating it with something like dadaengine
create a complicated and obfuscated protocol, with lots of hard to reverse engineer code.
write the software in a language for which there are no easily accessible reverse engineering tools. ( like using using many virtual functions in C++, or objective C, or lisp)
choose a protocol that is hard to decode with something as simple as tcpdump. for example, by using https. and verifying site certificates. then to analyse the protocol I would have to insert some code at the application level, to monitor all http requests
use captchas for user authentication. I noticed that for hotmail the nr of different captchas is limited, so a dictionary of known captchas can be made
put text in on the fly generated gifs, which have a slightly noisy background. so it becomes more difficult to create a dictionary of gifs -> meaning.
this may be circumvented by using graphics software with the right filtering tools
proberen te regelmatig gedrag te detecteren, en uit te filteren.

these will not permanently solve scriptability problems, but at least postpone them.