donderdag 16 mei 2013

Raw Data-Mining: using Wget with TOR

The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Raw data-mining can be seen as ripping all information from a database. This will result in a hugh amount of traffic on server level, causing latency on the provided webservice. Therefore some sysadmins have configured their servers to block the IP which starts demanding a hugh (abnormal) load for a certain timeframe. To bypass this protection, we should make a system which uses a different IP for each request. For this TOR might become handy. At least if this serverfarm is not blocking the TOR exit nodes. :-)

With scripted data-mining, we want to use wget to rip data and tunnel it through TOR anonymous network to avoid IP blockage at the serverfarm? One way to do it is to use wget, TOR, and Privoxy to get what you need.

Explanation: Tor is a SOCKS proxy in which your date is sent over a network in a pretty anonymous fashion. The problem with tor is that it does not offer a http proxy which is what wget can use. So to get around this you can install Privoxy which will allow you to connect to TOR via a simple HTTP proxy.

So, lets get started.

Step 1 - Install the stuff
you can install all you need with the following command
sudo apt-get install -y tor tor-geoipdb privoxy

Step 2 - Configuration
There are a few things that need to be configured.
1. /etc/wgetrc
Find line starting with: #http_proxy =
Replace whole line with: http_proxy = http://localhost:8118

2. /etc/Privoxy
Add the following to the top of the file
listen-address localhost:8118
 forward-socks5 / .

Step 3 - Start every thing up
sudo service tor restart; sudo service privoxy start

Now when you use the wget command your data will be tunneled through the TOR network. you'll notice when you run the wget command that you will see a line like the following
Resolving localhost...
Connecting to localhost||:8118... connected.
The :8118 shows that your connection is going to Privoxy which in turn goes to TOR.

Note: You download speeds will be significantly redued due to the fact that your data will be tunneling through the TOR network. The configuration of TOR is not in the scope of this article.

1 opmerking:

  1. This is a smart way to do data mining and the TOR network will act as a throttle on the bandwidth so server admins really don't have much to complain about. Also you may wish to look at the options for --header to put custom headers some sites like "Referer: /" and the useragent -A or --header "User-Agent: "

    Also I noticed Ubuntu is frozen (as of 12-26-2014) at version 1.13 for Wget, yet this version does not handle SSL very well. Curl is being promoted heavily yet does not perform recursion, here is a link to the resolution:
    Except you have to issue two more commands, make and make install:
    tar -zxvf wget-1.15.tar.gz
    cd wget-1.15
    ./configure --prefix=/usr/local/bin --with-ssl=openssl
    make ### added this
    make install ### added this
    # replace wget, confirm
    ##cp /usr/local/bin/wget /usr/bin/wget ## this wasn't necessary

    If you are running a CHROOT like me you will fail to install as regular user but thats okay, the new wget will be in the src/ directory just make a new shortcut to it and it should run fine from the new location. Its also probably a good idea to keep the source code for all the programs that you like, they seem to disappear or get broken especially the useful ones that can be considered "force-multipliers".

    On a philosophical note, if GOD has rules about data I wouldn't be surprised if there is a rule that says all the data that you justly possess in your lifetime belongs to you and your Godhead for the rest of your existence.