Securing published web content
This is a web application in development. It protects the content of a web site from unwanted crawler/bots, e-mail and content harvesters.
A few years ago we found that the product description of one of our clients e-commerce site were published in identical format on a competitor's website. It was probably pot luck that we found out about it so quickly, but to get the content removed again takes some unplanned effort. Not to talk about the potential loss in sales.
Investigating the situation made us come to the conclusion that it is almost impossible for a webmaster to protect the content of a site. Searching for a solution turned out to be fruitless too. This all didn't help.
The problem one is facing is that a crawler or bot or web site scraper or whatever you want to call it, can request pages with such a high speed that, even if you check your site log files very often, you don't have a real chance to do something about it - you will probably be always too late. A medium size web site with a few thousand or a couple of tens of thousand web pages can be "harvested" within no time. There are plenty of companies offering software or a service to scrape web sites. There customers are potentially your next unwanted visitors.
About the same time we faced the challenge to protect the content of another clients web site. That's when we started thinking about an automated solution to end this problem once and for all. The goal is: The published data shouldn't be harvested by suspicous sources or competitors or maybe future competitors. Blocking visitors via the firewall is one approach, but that wasn't good enough.
To make this work we first needed to get all the relevant log data in real time. Then we needed to make a decision on what is a desired visitor and which requests are unwanted.
The later one is the reason why this project is still in development as it takes real data to be analysed in order to create an algorithm which will work.
After now over 3 years of research and development we are able to detect "harvester" specific patterns, IP address / IP blocks, user agents and a lot more.
We have the product already in use for client sites and it does a great job, as far as we can tell. We send thousands of request to "never-never-land" on a daily basis. So far we gathered over 85 million requests from over 1,500 web sites and stored them in our research database.
This solution not only protects your web content it also saves you on bandwith and server resources. It reduces the errors produced by malformated or bad programed crawlers or bots. It will save you a lot of time if you check your web log files every day. It gives you peace of mind.
The finished product will detect an unwanted visitor while he is still doing his work on your site and not once he got everything he was after. We found, that after some time, the unwanted requests became less which is probably down to the fact that the unwanted visitor found out that his chances are slim to get the data he was after.
We have planned to release the commercial and distributable version in Spring 2011. The application will run on a Windows 2003 or higher server and requires MS SQL 2005 or higher.
If you are interested in our solution, please get in touch with us.
Bookmark this page by using: