The term data harvesting, or web scraping, has always been a concern for website operators and data publishers. Data harvesting is a process where a small script, also known as a malicious bot, is used to automatically extract large amount of data from websites and use it for other purposes. As a cheap and easy way to collect online data, the technique is often used without permission to steal website information such as text, photos, email addresses, and contact lists.
One method of data harvesting targets databases in particular. The script finds a way to cycle through the records of a database and then download each and every record in the database.
Aside from obvious consequence of data loss, data harvesting can also be detrimental to businesses in other ways:
Poor SEO Ranking: If your website content is scraped, reproduced and used on other sites, this will significantly affect the SEO ranking and performance for your website on search engines.
Decreased Website Speed: When used repeatedly, scraping attacks can lower the performance of your websites and affect the user experience.
Lost Market Advantages: Your competitors may use data harvesting to scrape valuable information such as customer lists to gather intelligence about your business.
I think if you refer it for research then concentrate on techniques rather than tools.
There are also a lot of other Data Mining techniques but these seven are considered more frequently used by peoples.
The term data harvesting, or web scraping, has always been a concern for website operators and data publishers. Data harvesting is a process where a small script, also known as a malicious bot, is used to automatically extract large amount of data from websites and use it for other purposes. As a cheap and easy way to collect online data, the technique is often used without permission to steal website information such as text, photos, email addresses, and contact lists.
One method of data harvesting targets databases in particular. The script finds a way to cycle through the records of a database and then download each and every record in the database.
Aside from obvious consequence of data loss, data harvesting can also be detrimental to businesses in other ways:
Poor SEO Ranking: If your website content is scraped, reproduced and used on other sites, this will significantly affect the SEO ranking and performance for your website on search engines.
Decreased Website Speed: When used repeatedly, scraping attacks can lower the performance of your websites and affect the user experience.
Lost Market Advantages: Your competitors may use data harvesting to scrape valuable information such as customer lists to gather intelligence about your business.
I think if you refer it for research then concentrate on techniques rather than tools.
There are also a lot of other Data Mining techniques but these seven are considered more frequently used by peoples.
If the web site offers an API, like Twitter or Facebook do, you can collect data directly from within your Java, Python or other application. If there is no API, you can use a web scraper as mentioned before.
you should develop your own program which will crawl relevant information from internet structured text. You can use any computer language along with regular expression to identify relevant information from structured text.