Middleman filtering proxy
server
(c)2002 Jason McLaughlin
<jasonmc@sympatico.ca>
http://www.sourceforge.net/projects/middle-man
Introduction
Middleman is a robust proxy server with many features designed to remove unwanted content, increase privacy, and to simply make surfing the Web a more pleasant experience. Some of the highlights of Middleman include banner and popup blocking, HTTP and FTP content caching, NTLM and Basic authentication when forwarding through another proxy server, regular expression substitution in downloaded files and HTTP headers, regular expression substitution on requested URL's, many URL commands to temporarily change the proxy settings or to view information about a requested file, complete support for HTTP/1.1 including persistent connections and gzip encoding, and an intuitive Web interface for configuring the proxy.
Installation
Installing Middleman should be straightforward. After
extracting the archive type "./configure && make",
if you're using a BSD operating system you will need to use "gmake"
rather than make, if that's unavailable as a last resort you can use
BSD's make, then enter the "gcc -o mman *.o -pthread -lz"
command afterward. There are several compile-time options available
for the configure script, type "./configure --help" to see
a complete list.
If you wish to have the proxy server loaded
at boot time, there is a script in the "scripts" directory
called mman.init to assist you with that, simply edit the paths at
the top then copy it to the "/etc/rc.#" directory, where #
is your current runlevel (if you're unsure what it is, use the
"runlevel" command). You may need to rename the script, if
you're using a debian-based distribution the naming scheme for init.d
scripts is in the form "S##program", where ## is the order
in which the script is loaded, and "program" is the
program's name.
There are several command line options you may
use when loading the proxy server; at the very least you will need to
use the -c option followed by the path to the configuration file. The
-p option can be used to have middleman check (and create) a file
containing the PID of the proxy server, this can be used to prevent
multiple instances of the proxy server from running concurrently. The
-l option can be used to specify the path to the logfile if the
--enable-syslog option wasn't used during compilation, and -d to
specify the level of detail which should be logged; use -h for a
complete list of loglevels.
Using
Once the proxy server is running, you'll then need to
configure your web browser to use it.
If you're using Mozilla,
open up the edit menu and click on preferences. Expand the "advanced"
options then click on "Proxies". Click on the "Manual
proxy configuration" radio button then fill in the HTTP and
HTTPS fields with the IP address and port of the proxy; if you're
using the default configuration, the port will be 8080.
If
you're using Konqueror, open up the settings menu and click on
"Configure Konqueror". Click on the icon labeled "Proxy"
in the left pane, click the "Use proxy" checkbox and then
the "Manual proxy configuration" checkbox. Click on setup
to the right of that then fill in the HTTP and HTTPS fields with the
IP address and port of the proxy.
URL commands
URL commands can be used to show information about a webpage and to bypass certain features. For proxy requests, URL commands are prefixed onto the hostname of the website; for example, 'bypass..www.somesite.com" would bypass all types of filtering. For regular HTTP requests (such as when the proxy is being used to redirect HTTP requests), an extra path element is added to the front of the requested file with the URL command inside; for example, "http://proxyip:port/bypass../somefile". URL commands are not only taken from the request URL, but from the Referer header sent by your browser as well; this allows them to work for images and files loaded from a website a URL command was used on. Additionally, URL commands are automatically prefixed to the Location: header sent back when a 302 redirect is received or when a redirect rule that sends a 302 redirect matches. Below is a list of all available URL commands and a description of what they do. |
|
bypass |
Bypass some or all features; to specify which features to bypass, add a set of square brackets after the URL command with any of the following letters representing what features to bypass: 'f' (URL filter), 'r' (redirecting), 'w' (rewrite), 'h' (header filter), 'm' (MIME filter), 'c' (cookie filter), 'e' (external parser), 'p' (forwarding), 'k' (keyword filtering), 'd' (DNS blacklist), and 'l' (Limits). A '+' or '-' can be used to alternate between bypassing and un-bypassing a feature, incase it was bypassed already from the access rule. For example, "bypass[f-rw]..website.com" will bypass filtering, and un-bypass redirecting and rewriting. |
filter |
Show which filter rule, if any, matches the requested URL. |
mime |
Show which mime rule, if any, matches the requested URL. |
score |
Show the keyword score for the requested URL. |
diff |
Display changes made by rewrite rules to requested URL. |
cache |
Display information about cached file. |
offline |
Send file only if it's cached, ignoring expiry time. |
fresh |
Fetch a fresh copy of the file from the server. |
headers |
Show client and server headers for requested URL. |
raw |
Show raw contents of file or FTP directory listing. |
proxytest |
Test a forwarding proxy by having it connect back and display headers. |
htmltree |
Display a parsed HTML tree for the specified URL. |
profiles |
Display a list of profiles currently enabled for the requested URL. |
Configuration
Most of the configuration is made easy by the Web
interface; however, it may be necessary to manually edit the
configuration file to change network settings if the default is
unusable on your configuration. The snippet of XML below shows what
the configuration section looks like:
<network>
<listen>
<ip>127.0.0.1</ip>
<port>8080</port>
</listen>
</network>
Each <listen>
section inside the <network> section has an <ip> and
<port> option, which should contain after them the IP address
and port number to listen on, respectively. You may leave out the
<ip> option to have Middleman listen on all interfaces.
Middleman, by default, can listen on up to 20 ports at a time.
The
default configuration also limits access to only allow requests from
127.0.0.1, if you are unable to configure the proxy through the Web
interface on the system the proxy is running on, manual adjustments
will need to be made to the configuration file. Search for the
<access> section, within the <allow> inside it you should
see an <ip> tag, replace the “127\.0\.0\.1” with a
regular expression matching the IP addresses you wish to access the
proxy from, or just remove the <ip> tag altogether to allow
access from any machine.
As mentioned above, all other configuration settings can be modified through the Web interface. To access this, while using the proxy load "http://mman" in your browser.
Once you've loaded the Web interface, you will see a
page with several links available at the top.
The "Active
connections" link will display a page showing all connections
currently being handled by the proxy.
The "DNS cache"
link is for debugging purposes only, and will display entries in the
DNS cache.
The "Show headers" link will bring you
to a page showing all the HTTP headers your browser sends, and what
they look like after being filtered. Note: headers handled by
Middleman aren't shown, this is to avoid confusion.
The "Save
settings" link will bring you to a page with a Filename dialog
where you can save all current settings, by default it will be filled
with the path to the configuration file given when the proxy server
was loaded.
The "Load settings" link will also bring
you to a page with a Filename dialog, as well as an "Overwrite"
option. The overwrite option can be used to select whether the
settings contained in the configuration file will overwrite all
current settings or simply be added to them.
The "View
log entries" link will bring you to a page showing recent
entries made to the logfile, and will allow you to search through
them using regular expressions. The log buffer can also be cleared
from here, as well as have it's size adjusted. The level of logging
detail available through the web interface is unaffected by the
options given in the command line, and will always be all log entires
with the exception of debug messages.
The "View cache
entries" link will bring you to a page showing cached files, and
give you the option to search through and selectively delete
them.
The "Connection pool" link will bring you to a
page showing connections currently being held open in the connection
pool awaiting reuse.
The “Prefetch queue” link will bring you to
a page showing all prefetch requests in the queue and an option to
add additional files to it.
The "Config" link will
bring you to a page where all configuration settings can be accessed.
On the main page you will see a dialog with a drop down list
containing the name of each section, as well as a table with a list
lf each section and an enable/disable radio button beside it; this
can be used to quickly enable/disable a feature if it's causing
problems with a website.
When you select an item in the drop
down list and click on the submit button, you will be brough to a
page containing a dialog at the top as well as a list of entries for
that section below. The dialog at the top will always contain an
"add" link, which can be used to add an additonal entry to
the section, and in some cases will have several other options which
will be explained below. Each entry at the bottom has an "Edit",
"Delete", "Up", "Down", "Top",
and "Bottom" link. The edit link will bring you to a dialog
where you can edit that specific entry, the delete link will remove
it from the section. The "Up" and "Down" links
allow you to change the order of the entries, this is important in
cases where more than one entry can match the same thing. The
"Top" and "Bottom" links can be used to move the
entry to the very top or bottom of the list.
All entries for
all sections have an "Enabled" option which allows you to
disable a specific entry, a 'Comment' option to describe the purpose
of the entry, and a 'profiles' option.
The 'Profiles' option
can be used to have seperate configuration settings for different
users; a comma separated list is used to specify each configuration
profile that entry belongs to, and that entry will only be enabled
for users in one of those profiles. The 'Profiles' option in Access
entries is used to specify which configuration profiles are enabled
for connections matching that entry. Any profile name starting with a
'!' (exclamation mark) will enable the entry only if the profile is
not enabled.
Several sections follow an allow/deny/policy
model; for these sections, each entry has an action option which will
specify what happens when it is found to match. If no matching entry
is found, the action the policy is set to will be taken. It is
important to remember that all entries with an action opposite to the
policy are searched first, and if nothing is found the entries with
an action the same as the policy are not searched. So, for example,
if the policy for the access section is set to "allow", and
no entries with a "deny" action are found matching the
connection, none of the entries with an "allow" action are
looked at, so any access limitations specified in the allow entry are
ignored.
The tables below will describe all the options
available in each section and the entries within them.
--- Global section ---
Purpose |
|
The global section gives access to configuration options that affect the overall operation of the proxy server. |
|
General subsection |
|
Connection timeout |
The timeout in seconds to wait for a connection to be established before giving up. |
Timeout |
The timeout in seconds to wait for a client to make the initial HTTP request. |
Keepalive timeout |
The timeout in seconds to wait for keepalive requests. |
Maximum buffer size |
The maximum size in bytes of files that are buffered and processed by the rewrite, keyword, and external features. |
Temporary directory |
The directory temporary files are stored in. |
CONNECT ports |
The ports outgoing CONNECT requests are allowed to be made to; each port or port range should be separated by a comma. A port range is a lower and upper port separated by a comma, either may be omitted to allow the lowest or highest possible ports. For example: "-1024, 8888, 6660-6669" will allow connect requests to be made on ports 0 to 1024, 8888, and 6660 to 6669. |
Connection pool size |
The number of keep-alive connections to HTTP and FTP servers to keep in the connection pool; these connections will be shared between threads. |
Connection pool timeout |
The time in seconds a connection may remain in the connection pool before being removed. |
Always compress mimetype |
A regular expression matching the MIME-types which should always be buffered and compressed even if they wouldn't be buffered otherwise. |
Compress outgoing |
Toggle gzip or deflate encoding of outgoing processed content if the client supports it. If the proxy server is running locally, it is recommended you disable this feature. |
Compress incoming |
This option will make Middleman attach an Accept-Encoding header that lets the Web server know we can accept gzip and deflate content encodings regardless of whether or not the browser making the request supports it; if the browser doesn't support it, it will be buffered and decompressed before sending. |
FTP subsection |
|
Passive mode |
Use passive mode for FTP transfers; this is useful if you are behind a firewall that prevents the FTP server from opening a connection to you. |
Timeout |
The timeout to wait for a response to commands sent to the FTP server. |
Anonymous login |
The login to use when none is explicitly given in the URL. |
Anonymous password |
The password to use when none is explicitly given in the URL. |
Sort order |
The order FTP directory listings are sorted. |
Sort field |
The field which FTP directory listings are sorted |
DNSBL subsection |
|
Template |
The template to send when domain is found to be blocked. |
Domain |
The domain to prefix the domain being checked to; i.e. in.dnsbl.org will cause a lookup for bad.com.in.dnsbl.org to be made when a page from the bad.com domain is requested. |
Blocked IP addresses |
A comma separated list of IP addresses that can be returned when doing the DNS lookup which will cause the page to be blocked. |
Prefetch subsection |
|
Threads |
The number of threads to run in the background prefetching files. Middleman needs to be restarted for this setting to take effect. |
Queue size |
The size of the prefetch queue. |
Tags |
A comma separated list of HTML tags and their attribute where URL's that should be prefetched are found. Each item should have a tag and attribute separated by a colon. For example, to have all images prefetched, you would use “img:src”. You may also add a third option separated by another colon to each tag/attribute set which will set the maximum level of recursion if the URL in the attribute leads to another HTML page, a value of 0 indicates infinite recursion, the default is 1. |
--- Network section ---
Purpose |
|
The network section is used to configure general network settings. The configuration file must be saved and the proxy server has to be restarted before any changes take effect. |
|
Entry options |
|
IP |
The IP address of the interface to bind to; leave empty to have the proxy listen on all interfaces. |
Port |
The port number to listen on. |
--- Profiles section ---
Purpose |
|
---|---|
The profiles section allows groups of profiles to be enabled or disabled based on the URL being requested. This is useful to enable or disable groups of related entries in other sections together. |
|
Entry options |
|
Host |
A regular expression matching the host's this entry applies to; leave empty to match everything. |
File |
A regular expression matching the file's this entry applies to; leave empty to match everything |
Portrange |
A comma seperated list of ports or port ranges this entyr applies to. |
Added profiles |
A comma separated list of profiles to add when this entry matches. |
Removed profiles |
A comma separated list of profiles to remove when this entry matches. |
URL Command |
A comma seperated list of URL commands which will activate this entry. If left empty, this entry is enabled regardless of which URL commands are used. |
--- Access section ---
Purpose |
|
The access feature is used to control who can access the proxy server, and to what extent. |
|
Global options |
|
Policy |
Default action to take when no matching entry is found. |
Entry options |
|
IP Address |
A regular expression matching the IP addresses this entry applies to, leaving blank will cause the entry to match everything. |
PAM authentication |
If this option is selected, clients will be required to authenticate with the proxy and PAM will be used to authenticate the username and password. |
Username |
If this field is not empty, clients matching this entry will be required to authenticate with the proxy server. There can be more than one entry matching the same IP address, in which case the one matching the username/password send by the browser is used. This option is a regular expression. |
Password |
The client's password if the username field is used. |
Access |
A list of features connections matching this entry are allowed to access, the options are: |
Web interface - Access to all of the web interface (access to /mman/template/<template name> is always allowed regardless of this) |
|
Proxy requests - Allowed to make regular proxy requests |
|
CONNECT Requests - Allowed to make CONNECT requests |
|
Transparent proxying - Allowed to make transparent proxy requests (must be allowed to make HTTP requests as well) |
|
HTTP Requests - Allowed to make regular HTTP requests to proxy (for Web interface and redirected requests) |
|
Allow bypassing - Allows features to be bypassed by prefixing with URL command |
|
URL commands - Allows use of URL commands |
|
|
|
Bypass |
A list of features which will by default be bypassed when making requests. |
|
|
--- Cache section ---
Purpose |
|
The caching section is used to configure global cache options and to add/remove cache storage spaces. |
|
Global options |
|
Violate RFC |
This option will cause the proxy server to violate some rules in the HTTP RFC to help improve cache performance; specifically, when a website requests that the file not be cached with the “No-Cache” directive in the Cache-Control header, the proxy will cache it anyways but always validate it with an If-Modified-Since conditional request. |
Memory cache size |
The maximum size in bytes of the memory cache. |
Memory free extra |
The number of additional bytes to free up when the memory is cleaned. |
Minimum age |
The minimum age any file must be according to the Last-Modified header before it is cached. |
Maximum age |
The maximum age of any cached file before it must be revalidated; this overrides any given expiry time. |
Revalidate age |
The maximum age of any cached file which didn't include any headers that indicate when it should expire before it must be revalidated; if set to 0, all cached files whose expiry time is uncertain will be verified. If no "Last-Modified" header is received to calculate the percent of age freshness, the cached file is always revalidated. |
Last-Modified time factor |
The percentage of time between the date given in the Last-Modified header and the current time a cached file will be considered fresh after downloading. |
Minimum file size |
The minimum file size in bytes of any cached file. |
Maximum file size |
The maximum file size in bytes of any cached file; if set to 0, no maximum file size is imposed. |
Maximum wait size |
It's not possible with the current design for one thread to download a file being cached by another thread, but that thread can wait for the other to finish and then use it; this option will configure the maximum size the cache file can get before it stops waiting and transfers the file directly from the server. |
Prefetch window |
This option can be used to specify the time period after a file is prefetched in which it will be exempt from any refresh or expiry rules. |
ICP port |
The UDP port to listen for ICP packets on. |
ICP timeout |
The timeout in milliseconds for response ICP packets. |
Store balance method |
This option controls how the storage directory a file goes into is selected. Fill size will select the storage directory with the least total bytes used, Fill percent will select the storage directory with the lowest percentage of space used. |
Entry options |
|
Path |
The directory where cached files are stored. |
Maximum disk size |
The amount of space that should be used to store cached files in this directory. |
Disk free extra |
When the cache is cleaned, this additional amount will be freed as well. This option can be useful to prevent the cache from getting evicted too often, which can hurt performance. |
--- Templates section ---
Purpose |
|
Templates are used throughout Middleman as a replacement for pages which can't be displayed due to filtering, error, or other conditions. |
|
Global options |
|
Path |
Location to look for templates in if no absolute path is given. |
Entry options |
|
Name |
The name of the template, this is used in other sections to reference it. It may also be one of the following to replace internal error messages: |
blocked - Page blocked |
|
nodns - DNS lookup failed |
|
badrequest - Malformed HTTP header from client |
|
badresponse - Malformed HTTP header from server |
|
nofile - File not found |
|
nocache - Cache file not found when browsing in offline mode |
|
noconnect - Connection failed |
|
noaccess - Access denied |
|
badprotocol - Protocol not implemented |
|
badauth - Authorization failed (when forwarding through SOCKS4) |
|
maxbandwidth – Bandwidth limit exceeded |
|
maxrequests – Request limit exceeded |
|
proxy.pac – A script to configure the browser to use the proxy. |
|
|
|
There are 3 built-in templates that can be used: tinygif (a 1x1 transparent gif image), checkedgif (a 4x4 gray and transparent checkered pattern), and tinyswf (an empty flash animation). |
|
|
|
You can override the content sent by a website for certain response codes by making a template with a numerical name the same as the response code. |
|
|
|
There are several variables that can be used in templates if the parsable option is selected which will be replaced with information about the request currently being handled, they are: |
|
$HTTP_METHOD - Method used to request file |
|
$HTTP_HOST - Host HTTP request was made to. |
|
$HTTP_FILE - File HTTP request was made for. |
|
$HTTP_PORT - Port HTTP request was made to. |
|
$IP - IP address of client making request. |
|
$INTERFACE – IP address of the interface the client connected to. |
|
$PORT – PORT the client connected to. |
|
|
|
Templates can be accessed directly by loading "http://mman/template/<template name>". |
|
|
|
File |
The filename of the template |
Mimetype |
The MIME-type of the template. When using an executable, this can be set to STDIN to have the MIME-type extracted from a "Content-type" header sent by the program, this will be explained in greater depth below. |
Response code |
The response code to use when sending the template, leave blank to use internal default. |
Type |
Template type, either File or Executable. If executable is choses, the file is executed and whatever it writes on STDOUT is sent as the template. Several environment variables are set for the executable to use, they will be explained further below in the external section. |
Parsable |
If this option is selected, all variables in the template will be substituted. |
--- MIME section ---
Purpose |
|
The mime feature allows you to filter content based on it's MIME-type. |
|
Global options |
|
Policy |
The action to take when no matching entry is found. |
Default template |
The template to send for blocked MIME-types if the template option is left blank for the matching entry, or if no matching entry is found but the policy is deny. |
Entry options |
|
Host |
A regular expression matching the host's this entry applies to, leave blank to match everything. |
File |
A regular expression matching the file's this entry applies to, leave blank to match everything. |
Mimetype |
A regular expression matching the MIME-type's this entry applies to, leave blank to matching everything. |
Template |
The template to send when an entry matches, this has no purpose in entries with the action set to allow. |
--- Redirect section ---
Purpose |
|
The redirect feature allows you to redirect requests. |
|
Entry options |
|
URL |
A regular expression matching the URL's you wish to redirect; the URL will always be in the form "protocol://host/file" or "/file" for HTTP requests. |
Redirect |
The URL to redirect to; it may contain back references to
strings captured using parenthesis in the URL pattern. This can
be in the form "protocol://host/file", or "/file"
if you wish to send a relative URL when redirecting a URL in the
Location: header. If this option is left blank, no action will
be taken against requests matching the URL. |
Port |
The port to redirect to; if left blank the same port the original request was made to is used. |
302 Redirect |
If yes, a 302 redirect is issued; otherwise the new host is connected to directly and the new file is requested. A 302 redirect should always be used when possible to ensure relative links and images are correct. |
Options |
Several options are available to control how the URL should be handled, they are: |
Encode URL - Encode the new URL. |
|
Decode URL before - Decode the URL before attempting to match it with the regular expression |
|
Decode URL after - Decode the new URL after matching. |
|
|
|
Applies to |
This option is to choose whether the redirection applies to requested URL's, the Location header when a remote site sends a 302 redirect, or both. |
--- Forward section ---
Purpose |
|
The forward feature allows you to selectively forward requests through another proxy or SOCKS4 firewall based on their URL. |
|
Entry options |
|
Proxy |
The hostname or IP address of the proxy to forward through; if this is left blank, and the host or file options aren't, no action will be taken for requests matching the host and file. |
Username |
The username to use if the proxy requires authentication. |
Password |
The password to use if the proxy requires authentication. |
Domain |
The NT domain when using the NTLM authentication protocol. |
Port |
The port number of the proxy to forward through. |
ICP Peer type |
The peering releationship of this proxy, see “Internet caching protocol” section below for further details. |
ICP Port |
The UDP port ICP packets are sent to. |
Type |
What type of proxy to forward through; can be HTTP, SOCKS4, or CONNECT. |
Applies to |
What type of requests are forwarded; can be HTTP and/or CONNECT (HTTPS) |
--- Header section ---
Purpose |
|
The header feature allows you to control what headers are passed from your browser to websites. In additional to the allow and deny actions in some other sections, there is an insert action which will add a new header onto the ones sent by your browser; for these entires, the Type and Value options are plain text. |
|
Global options |
|
Policy |
The action to take when no matching entries are found. |
Entry options |
|
Type |
A regular expression matching the header type's this entry applies to; leave blank to match everything (header's are in the form "Type: value"). |
Value |
A regular expression matching the header value's this entry applies to; leave blank to match everything. |
Applies to |
This option is to select whether or not this entry applies to the server header, client header, or both. |
--- Rewrite section ---
Purpose |
|
The rewrite feature allows you to use regular expressions to modify the contents of web pages, files, the client header, and server header. |
|
Entry options |
|
Mimetype |
A regular expression matching the MIME-type's this entry applies to. This must be filled with something, otherwise the rewrite rule will be applied to every downloaded file, which is almost certainly not what you want. To have it applied to web pages, fill this field with "text/html" |
Size |
The maximum size of the file this entry is allowed to match. -1 matches all files smaller than the maximum buffer size, 0 matches all files, and > 0 matches files up to that size. If a file is larger than the maximum buffer size, it will be partially buffered and the rest will be sent unprocessed. |
Pattern |
A regular expression pattern matching the area of text inside the file to modify; if this is left blank, and the host, file, or mimetype options aren't, this will be the last entry matched for sites matching the host, file, and mimetype. |
Replace |
The replacement text to use in place of the area of text
matching the pattern; it may contain back references to strings
captured using parenthesis in the pattern. |
Applies to |
This option is to select what the rewrite rule applies to; the options are: |
Client header - rewrite the client header; this happens before Middleaman parses it so be careful not to remove any headers needed to handle the request properly. The Mimetype option serves no purpose for this. |
|
Server header - rewrite the header from the remote web server; same conditions from client header apply. |
|
Body - rewrite the body of the webpage or file. |
|
POST data - rewrite POST/PUT data sent when submitting a form or uploading a file. |
|
|
--- Cookies section ---
Purpose |
|
The cookies feature allows you to choose which hosts your browser is allowed to send and receive cookies to and from. |
|
Global options |
|
Policy |
The action to take when no matching entry is found. |
Entry options |
|
Direction |
The direction of the cookie this entry applies to; can be either in (Set-cookie sent by website), out (Cookie sent by browser), or both. |
--- External section ---
Purpose |
|
The external feature allows you to use any program or script to parse the contents of a file. |
|
Entry options |
|
Executable |
The path to the executable; if no absolute path is given, the
path as given in the PATH environment variable is searched. |
HTTP_METHOD - Method used to request the file. |
|
HTTP_HOST - Host HTTP request was made to. |
|
HTTP_FILE - File HTTP request was made for. |
|
HTTP_PORT - Port HTTP request was made to. |
|
IP - IP address of client making request. |
|
INTERFACE – IP address of the interface the client connected to. |
|
PORT – Port the client connected to. |
|
|
|
Mimetype |
A regular expression matching the MIME-type's this entry applies to, leave blank to match everything. |
Size |
The maximum size of the file this entry is allowed to match. -1 matches all files smaller than the maximum buffer size, 0 matches all files, and > 0 matches files up to that size. If a file is larger than the maximum buffer size, it will be partially buffered and the rest will be sent unprocessed. |
Newmime |
The MIME-type of the content returned from the external
program, leave blank to have the original MIME-type
preserved. |
Type |
The method which content is passed to the external program; if set to Pipe the content is piped to the program's STDIN, if set to File the content is stored in a temporary file and it's path is passed as the last argument. |
--- Keyword section ---
Purpose |
|
The keyword filtering feature allows you to block pages which may contain inappropriate content using a scoring system. When the host, file, mimetype, and keyword in an entry matches a file, it's score is added to the total score; when that total score exceeds the threshold, the page is deemed inappropriate and blocked. |
|
Global options |
|
Template |
The template to send when a page exceeds the threshold. |
Threshold |
The number the total score must equal or exceed until it's blocked. |
Entry options |
|
Mimetype |
A regular expression matching the mimetype's this entry applied to; it is highly advisable that you set this to something, otherwise all file's will be checked; if you're unsure, set this to "text/" |
Size |
The maximum size of the file this entry is allowed to match. -1 matches all files smaller than the maximum buffer size, 0 matches all files, and > 0 matches files up to that size. If a file is larger than the maximum buffer size, it will be partially buffered and the rest will be sent unprocessed. |
Keyword |
A regular expression matching anything in the body of the document considered inappropriate, leave blank to match everything. |
Score |
The score this entry adds to the total score when it matches; this may be a positive or negative integer. |
--- Limts section ---
Purpose |
|
The limits feature allows access to the proxy server to be restricted based on date/time, bandwidth usage, and requests made. More than one entry may be allowed to match, in this case the last to match is used; this makes it possible for example to limit bandwidth usage between Monday and Friday to 1GB, and not allow more than 200MB per day. |
|
Entry options |
|
Template |
The template to send when this entry results in access being denied. The specified template is only sent if the page was blocked due to the time restrictions, 'maxbandwidth' and 'maxrequests' are used for excessive bandwidth or requests, respectively. |
Limit months |
Whether or not to limit access based on month. |
Month range |
The range of months this entry applies to. |
Limit days |
Whether or not to limit access based on day of month. |
Day range |
The range of days this entry applies to. |
Limit weekdays |
Whether or not to limit access based on the day of the week. |
Weekday range |
The range of weekdays this entry applies to. |
Limit hours |
Whether or not to limit access based on the current hour of the day. |
Hour range |
The hour range this entry applies to. |
Limit minutes |
Whether or not to limit access based on the current minute of the hour. |
Minute range |
The range of minutes this entry applies to. |
Time match mode |
The method used to match the time. Absolute will match the exact date/time, All ranges will match if the current time falls into every choses range. For example, with absolute selected with the weekday and time ranges respectively set to Monday-Friday, 8:00 to 17:00, the entry will match any time between 8am Monday and 5pm Friday; with all ranges selected the entry will match 8am to 5pm on any day between Monday and Friday. |
Byte transfer limit |
The maximum number of bytes that may be downloaded by the proxy during the time span this entry matches. |
Current bytes |
The current number of bytes transferred in this time span; this setting is not saved, reloading the configuration or restarting the proxy will reset it. |
Request limit |
The maximum number of requests that may be made by a client during the time span this entry matches. |
Current requests |
The current number of requests made in this time span; this setting is not saved, reloading the configuration or restarting the proxy will reset it. |
Transparent proxying
Middleman can be used to transparently proxy
requests; to make use of this feature, you will need to use firewall
software capable of forwarding connections. Configure the firewall to
forward connections destined for port 80 to the proxy server; the
proxy server will look at the Host header sent by the browser and use
that to determine what host the request was originally intended for.
This feature may not work for all browsers, sending the Host header
is only required for HTTP 1.1, although most HTTP 1.0 clients send it
anyways.
If you are using Linux as a firewall, the following iptables command will transparently proxy all outgoing requests on port 80 (replace eth0 and 8080 to match your setup):
iptables -t nat -A PREROUTING -i eth0 -p tcp --dport http -j REDIRECT --to-port 8080
Middleman as an HTTP server
Middleman is not only a powerful proxy server, it may also be used to accelerate and filter responses from an HTTP server. This can be accomplished using the redirect feature. To do this, simply make a redirect entry with a URL pattern that only matches the file portion of a URL, and a redirect URL that points to your Web server. You can utilize the full power of regular expression to perform various tricks such as correcting any 302 redirects the server sends back to point back to the proxy. Lets look at an example setup:
The Proxy server is running on address leenux.ath.cx port 80
The Web server is running on address 192.168.0.1 port 8000
First, add a redirect entry with the URL “/(.*)”, the “(.*)” portion of that is to capture the file that is requested.
Next, fill in the Redirect field with the URL of the Web server your going to redirect to, in this case “http://192.168.0.1/$1”, the $1 is used in regular expression to reference the first string captured using brackets. You will also need to fill in the port option with the Web server's port, in this case 100.
Uncheck the “302 redirect” option, this will cause Middleman to connect directly to the host and process the content rather than just sending back a 302 redirect instruction to the Web browser.
You may also want to have Middleman alter the Location header in any redirects the Web server sends back so they point to the proxy server. To do this, add another redirect entry and fill in the URL with a pattern which would match the Location header sent back. Lets assume our Web server thinks it's hostname is “intranet”; you would fill in the URL field with “http://internal/(.*)”, again we use “(.*)” to capture the file. Next, fill in the Redirect field with Middleman's hostname, in this case “http://leenux.ath.cx/$1”. Make sure to select “Location header” in the “Applies to” option.
Now Middleman should be forwarding all HTTP requests to the Web server, and processing the content as per the entries in other sections.
Caching
The following pseudo-code shows how the cache refresh logic works:
if “Expires” and “Last-Modified” headers are missing then validate file
else if Cache modification time + Maximum age < Current time then validate file
else if “Last-Modified” header is present and “Expires” header is not present and Last-Modified time + Last-Modified time factor < Current time then validate file
else if “Last-Modified” header is present and Last-Modified time + Minimum age > Current time then validate file
else if “Expires” header is not present and Cache modification time + Revalidate age < Current time then validate file
else send file unvalidated
Validation is accomplished by sending an “If-Modified-Since: <timestamp>” header with the request, the Web server will respond with code 304 if the file hasn't been modified.
Since not all Web servers support this, Middleman will also check the Last-Modified header and send a cached response if it maches the cached file's Last-Modified time.
Middleman will honor the following directives sent in the Cache-Control header by the client:
no-cache - Don't send cached copy (Middleman makes an exception and will revalidate it instead)
no-store – Same as no-cache
min-fresh=<seconds> - Cached file will only be sent if it will remain fresh for this long.
max-age=<seconds> - Cached file will only be sent if it is no older than this.
max-stale=<seconds> - Send cached file if it has been stale for longer than this amount of time. If the cache file is infact stale, a Warning header will be sent with the response indicating this.
and the following directives from Web servers:
no-cache – Don't cache, if the “Violate RFC” option is selected it will be cached anyways but always validated.
no-store – Same as no-cache
must-revalidate – The cache file will be validated on every request.
max-age=<seconds> - The file will expire after being cached for this long.
Internet caching protocol (ICP)
The ICP protocol is used for communication between caching proxy servers to help optimize bandwidth usage by sharing cached content. When a request for a URL is made, other cache peer's are queried using the ICP protocol to determine if they have the file already cached. There are two peer relationships a proxy may have: parent and sibling. A sibling is a proxy on the same hierachy level and files are only fetched from it if it has it already cached. A parent is a proxy one level up in the hierachy which the file will be requested from even if it doesn't have it cached already (therefore, causing it to cache it for other sibling proxies). The peer a file is requested from is selected using the following algorithm:
Request the file from a random sibling that has the file, if any; otherwise:
Request the file from a random parent that has the file, if any; otherwise:
Request the file from a random parent that doesn't have the file, if any allow it; otherwise:
Request the file from a proxy with no peer releationship.
For further information on the ICP protocol, see RFC 2186.
Example external parser
This is a trivial example of how to write an external
parser; this will replace any page with the word "sex" in
it with a warning message (shrug).
This should be used with the
type set to "File", Mimetype set to "text/html",
and Newmime set to "STDIN"
--- SNIP ---
#!/bin/sh if grep -i sex $1 > /dev/null; then echo "Content-Type: text/html" echo "" echo "<html><head><title>Inappropriate content</title></head>" echo "<body><font size=6>$HTTP_HOST$HTTP_FILE contains inappropriate content</font></body>" echo "</html>" exit 0 fi # Non-zero exit status returns original content exit 1 # Alternatively, you can send a Content-type header with the same MIME-type as the original document and cat the file (slower) echo "Content-type: $SERVER_CONTENT_TYPE" echo "" cat $1
--- SNIP ---
Frequently asked questions
Q: I setup middleman to use an external parser, but
it doesn't always work.
A: Check the "Maximum buffer size"
setting in the global section of the web interface, the file may be
too large.
Q: Some pages show strange numbers throughout the
document, and it hangs when loading a page.
A: Middleman is an
HTTP 1.1 proxy; some older browsers (such as Netscape 4.x) will not
work correctly with the proxy, the only solution is to upgrade your
browser.
Q: I keep getting "URL redirection limit
exceeded" errors for a page while using the proxy.
A: The
default configuration includes a redirect entry which bypasses link
tracking scripts by redirecting any request which has a URL within
the URL directly to that URL; i.e. requesting
"http://www.somesite.com/redirect.pl?http://someothersite.com"
will cause the proxy to send back a 302 redirect for
"http://someothersite.com". In most cases this works as
expected; however, on some sites, such as ones that make you go
through a login process and have the URL you originally requested
within the URL, this will not work. You can temporarily bypass this
by prefixing "bypass[r].." to the URL, or permanently
bypass it by adding a redirect entry above the link bypassing one
with a URL pattern matching the host and no Redirect field.
Notes
- For security reasons, the web interface is no longer accessible through regular HTTP requests.
- The limits feature doesn't count data transferred by the prefetch feature.
Reporting bugs
If you encounter any
problems while using Middleman, please contact me. If the problem
results in a crash, please follow these steps to help me debug the
problem:
1) Run "make clean" in the middleman directory
if you haven't already done so.
2) Recompile middleman using the
--enable-debug option in the configure script
3) Type "ulimit
-c unlimited" in your shell before running the proxy, this will
cause middleman to dump a core file when it crashes.
4) Email me
the compiled binary, core file, and configuration file you were using
at the time. The last few log entires would also be helpful.
It
also helps if you have Electric Fence installed on you're system
(it's a dynamic memory debugging library); the configure script will
automatically detect it's presence
and use it if it's available.
Feature requests
If you have any ideas on how Middleman could be improved, please email me (address at top)... I'll do my best to respond.