Automatically interacting with websites
here are some examples of how I create scripts to automatically
do stuff with various websites.
not all of these scripts work, some never have, just experiments.
this is a very simple shell script that I scheduled with cron.
I used it a couple of years ago to monitor snow heights in the czech
republic.
#!/bin/sh
fn=/home/itsme/prj/sneeuw/logs/`date +%Y%m%d`.$$
GET http://verkeer2.anwb.org/ash/Tsjechie.html >$fn
start of script not shown.
attempt at getting tv show information from a website.
my $station="46";
my $date="011027";
my $ua =LWP::UserAgent->new();
$ua->agent("Mozilla/4.76 [en] (X11; U; Linux 2.4.9 i686)");
my $jar= HTTP::Cookies->new();
my $rp1= $ua->request(GET "http://www.veronica.nl/cgi-bin/html/multiguide/show");
$jar->extract_cookies($rp1);
for my $page (qw(top complete_data bottom)) {
my $rq= GET "http://www.veronica.nl/cgi-bin/html/multiguide/$page/tv/$station/$date/";
$jar->add_cookie_header($rq);
#$rq->header("Accept" => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*");
#$rq->header("Accept-Language" => "en");
#$rq->header("Accept-Encoding" => "gzip");
#$rq->header("Accept-Charset" => "iso-8859-1,*,utf-8");
#$rq->header("Pragma" => "no-cache");
my $rp= $ua->request($rq);
print $rp->content, "\n";
}
another unfinished script, an attempt to parse an online database
with information about how various offenses are fined.
script linked here
example scripts of how to parse google output.
- sort of brute force search many possible combinations of a searchterm. this example shows how to search for copies of IDA.
- recover data for xda-developers from the google cache: googlexdadev.pl.
here is a generic script that can be used
to interact with webobjects-based servers.
the general layout of webobjects urls is as follows:
http://<hostname>/cgi-bin/WebObjects/<applicationname.woa>/[ <instanceid> / ]<actiontype>
actiontype can be:
wa = WODirectActionRequestHandler
.../wa[/<classname>][/<actionmethod>]
-> calls "<actionmethod>Action" on <classname>
[ or "defaultAction" if no action specified ]
[ or action on class DirectAction ]
wo = WOComponentRequestHandler
.../wo/[<pagename>/]<sessionid>/<contextid>.<elementid>
WebServerResources = WOResourceRequestHandler
- the instanceid is a number specifying which server your request is to be executed on.
- the sessionid is a string of letters and digits.
- the context id is a number that increments with each /wo/ request
- the element id is a unique 'dotted' number identifying a specific
componenent in the application
and yet another unfinished to get radio and tv-show information.
script to get the current trafic intensities for the west of holland.
I took about 2 weeks worth of these pictures and put them together to
for a time-lapse movie of traffic.
this script was scheduled every 15 minutes using cron.
#!/usr/bin/perl -w
use strict;
use POSIX;
use Time::Local;
use LWP::UserAgent;
use HTTP::Request::Common qw(POST GET);
use LWP::Simple;
use URI;
use Digest::MD5 qw(md5_hex);
chdir "/home/itsme/prj/sites/anwb/archive";
my $homepage=get "http://www.anwb.nl/servlet/Satellite?pagename=OpenMarket/ANWB_verkeer/PopupVerkeer®io=randstad";
my ($imgfile)= ($homepage =~ m{", $filename or die "open: $filename : $!\n";
print IMG $img;
close IMG;
}
see here for my other page on egroups ( or yahougroups as it is currently called )
this is a script intended to make a copy of a mailinglist archive
this script was never quite finished, the 'login' part is still missing.
It may be circumvented by loging in manually in a browser, and then copying the cookie to this script.
script to archive the fokke+sukke cartoons
first there was this script,
combined with this script to create indexes.
later both were combined in this perl script.
see this page for more information on these scripts.
script to login to hotmail, and (sort of) list contents of the mailbox.
I had plans to write an automated hotmail account creator, but this has
become more difficult since microsoft is now using captcha's to prevent
scripted registration software. there are a few possible ways around this
though.
- visual captcha's can be broken
- I noticed the number of different captcha's returned is limited.
as if they keep a small number of valid captcha's around for a couple of minutes.
making it possible to create many accounts by manually recognizing just 1
captch
there are other beter, more finished scripts, like httpmail.
this is a combination of a simple scheduled job
#!/bin/sh
cd /home/itsme/prj/sites/trafficnet
name=`date +%Y%m%d%H%M.%W.%w`
/usr/local/bin/wget -a trafficnet.log -O "daily/$name.html" -N http://maps.trafficnet.nl/asp/trafficstats.asp
and this script to create an overview of it.
this project started out by analysing the protocol used
by the online banking system of the 'postbank'.
later they made their service available over the internet, leading me
to create these scripts
Here I try to predict what movie will be playing here in delft next week.
I don't think I ever guessed correctly.
this script will make it easier to make wrong guesses based on hard data.
here a unfinished script for parsing parts of the cia worldfact book
here are some attempts to create bigger maps from small maps deliverd
by some websites. One problem I encountered, is that big maps are square,
while big parts of the earth are not. so it is impossible to match them up
accurately.
- this script is in vbscript, and uses geodan to display the netherlands
- this script is in perl, and uses routenet to display the netherlands
here is a script to sort certain items from
my local computer hardware store by price per significant attribute.
( like speed for cpu's, and mb for ram, and gb for hd's )
here is a script to automatically take part
in the vpro wetenschapskwis. it should get 'smarter' if you let it run longer.
improved version that keeps track of known answers in a small database here, this version also identifies itself to the server as 'wetenschapsbot'
this script I wrote to parse phpbb forum articles from html pages recoverd from browser caches, and google: parsephpbbforum.pl
how to avoid being scriptable
- use variable grammar, by generating it with something like dadaengine
- create a complicated and obfuscated protocol, with lots of hard to reverse engineer code.
- write the software in a language for which there are no easily accessible reverse engineering tools. ( like using using many virtual functions in C++, or objective C, or lisp)
- choose a protocol that is hard to decode with something as simple as tcpdump.
for example, by using https. and verifying site certificates.
then to analyse the protocol I would have to insert some code at the application level, to monitor all http requests
- use captchas for user authentication.
I noticed that for hotmail the nr of different captchas is limited, so a dictionary of known captchas can be made
- put text in on the fly generated gifs, which have a slightly noisy background. so it becomes more difficult to create a dictionary of gifs -> meaning.
this may be circumvented by using graphics software with the right filtering tools
- proberen te regelmatig gedrag te detecteren, en uit te filteren.
these will not permanently solve scriptability problems, but at least postpone them.