Wednesday, May 8, 2013

AntiScrape - IIS ASP.NET Http Module

Download Code

no-scrapeAntiScrape is an IIS ASP.NET Http Module to help in the fight against website scrapers!

How does it work?

AntiScrape hooks into the IIS ASP.NET request/response pipeline.

When your website users request a page, it automatically adds secret hidden links to the page that normal users in web browsers won't see. However, website scrapers that are scanning the HTML of the page will see the link and follow it.

Once the scraper has followed the secret link, they are recorded as being a scraper, and from then on what happens is entirely up to you!

You can either:

  • Return custom content and/or HTTP status code for every other request the scraper makes (even for legitimate content), or
  • Introduce a configurable random delay for all requests that the scraper makes, slowing the scraper down, or
  • Do nothing, other than log the scraping activity.

The settings for AntiScrape are integrated into the web applications web.config.

How are the scraping requests logged?

The module comes with a reference implementation of the data persistence layer for SQL Server, but other implementations can be easily added by implementing the IDataStorage interface in an assembly that resides in the web applications bin folder (remembering to remove the assembly containing the SQL Server reference implementation). AntiScrape uses Microsoft Unity to resolve the IDataStorage interface at run-time.

Development Goals

As part of this project I wanted to ensure that the module could be integrated with an existing web solution as easily as possible. Therefore the sample web project is just a vanilla web forms project created with File, New Web Project in Visual Studio, plus the dlls and some config changes. There are some other minor tweaks to show the results of the scraping requests in a table, but that’s it!

Development Status

This software is currently in active development, and so not recommended for production environments at present.

Stuff I have yet to deal with:

  • Legitimate scrapers – such as search engine crawlers
  • Randomizing the honeypot url

Once I have something reasonable that answers these I shall put together a NuGet package that integrates the module and applies the config transforms.

Live Demo

Live demo is available on Windows Azure here: AntiScrape Demo

Source Code

Source code is available on Github here.

Technology Used

No comments:

Post a Comment