How to scrape tweets ? – Twitter data scraping using WebHarvy

July 14, 2014, 10:50 pm

≫ Next: Scraping data from HTML by applying Regular Expressions

≪ Previous: Scraping Facebook graph search results

WebHarvy can be used to easily scrape tweets from twitter.com. The following demonstration video shows the steps involved.

As shown, using WebHarvy to scrape tweets is very easy. WebHarvy is a point and click visual web scraper, using which data to be extracted can be selected using mouse clicks.

In case you need to scrape tweets after logging in using your Twitter account please make sure that you follow the steps mentioned at http://www.webharvy.com/articles/sites-requiring-login.html.

To know more, please watch demonstration videos at http://www.webharvy.com/demo.html

15 days free evaluation version of WebHarvy may be downloaded from http://www.webharvy.com/download.html

↧

Scraping data from HTML by applying Regular Expressions

July 15, 2014, 1:22 am

≫ Next: Scraping images : various methods : WebHarvy

≪ Previous: How to scrape tweets ? – Twitter data scraping using WebHarvy

WebHarvy can scrape data from HTML source code of selected area (or whole of) of web pages by applying Regular Expressions.

During configuration, after clicking on an item, the ‘Capture HTML’ option under ‘More Options’ of Capture window allows the HTML of the item to be captured and displayed in the preview area. After this, Regular Expressions can be applied (More Options > Apply Regular Expression) to select data from a portion of the HTML code displayed.

The following video shows how this feature can be applied to scrape URLs from HTML.

Download & try the 15 days evaluation version

↧

Scraping images : various methods : WebHarvy

July 15, 2014, 1:38 am

≫ Next: Scraping hidden details using WebHarvy

≪ Previous: Scraping data from HTML by applying Regular Expressions

WebHarvy lets you scrape images from websites with ease (in addition to text). During configuration, you can directly click on an image to capture it. The resulting Capture window displayed will have a ‘Capture Image’ button, clicking which either the image file can be downloaded or its URL be captured. Know More.

Images can also be downloaded from its URL obtained by applying Regular Expression on its HTML content. This method is shown in the following demonstration video.

Watch more demonstration videos

Download the free trial version

↧

Scraping hidden details using WebHarvy

July 15, 2014, 2:20 am

≫ Next: Web Scraping from Cloud – WebHarvy on Amazon EC2

≪ Previous: Scraping images : various methods : WebHarvy

WebHarvy allows you to scrape hidden fields in websites which are displayed only when you click on a link or button. The ‘Click’ option in the Capture window can be used to display such ‘click to display’ fields. The following video shows the process.

The video below shows how contact details from Craigslist listing pages can be extracted using this feature.

WebHarvy also allows you to scrape data from the HTML of the page. For example, the following video shows how geo location (latitude, longitude) can be extracted from yellow page listings (map details) from its HTML – this data is not visible in browser.

Know More

↧

Web Scraping from Cloud – WebHarvy on Amazon EC2

November 16, 2014, 10:42 pm

≫ Next: WebHarvy version 3.4 released !

≪ Previous: Scraping hidden details using WebHarvy

WebHarvy requires Windows operating system to run. So in case you do not have access to a Windows PC or if you do not want to run WebHarvy on your local PC, you have the option to run WebHarvy from Cloud. Amazon Web Services (AWS) Elastic Compute Cloud (EC2) platform makes this possible. See the following link.

http://aws.amazon.com/ec2/

Amazon EC2 lets you run a remote Windows instance in Cloud. You can access this cloud based Windows instance via Remote Desktop

http://aws.amazon.com/windows/

Charges for EC2 are minimal and more importantly there is a free tier available for 12 months with the following details.

http://aws.amazon.com/free/

Watch the following video which shows how to launch a Windows instance in Amazon EC2.

You may also watch the following tutorial which explains the same.

Detailed AWS EC2 documentation for managing Windows instances may be viewed at the following link.

http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/concepts.html

Once you connect to the Windows instance via Remote Desktop, you can download and install WebHarvy in it. You will have to make sure that .Net 3.5 is installed in the Windows instance so that WebHarvy can run properly. Please contact us in case you need any assistance.

↧

WebHarvy version 3.4 released !

June 9, 2015, 10:43 pm

≫ Next: WebHarvy : 2 new methods of handling pagination

≪ Previous: Web Scraping from Cloud – WebHarvy on Amazon EC2

We’ve just released a new WebHarvy update. The following are the changes in this version.

Major:

Support for pagination where a link/button has to be clicked to load the next set of pages
URL based pagination – automatically increment a numeral in start page URL to load subsequent pages
One-click multiple image extraction from details pages (ex: capture multiple images from product details page)
Human emulation mode support for automatic pause injection
Online license activation introduced to prevent casual piracy

Minor:

‘Click’ option (Capture window > More Options > Click) can be used to navigate to the start page
Bug Fix : Data alignment issue in miner window data table when some records fields do not have a value (blank columns)
Bug Fix : Keyword based scraping when encoding is required
Scheduler option to overwrite or append the export file in case the file already exists
‘Follow this link’ option enabled in details pages (pages reached by following links from starting page).
Bug Fix : Images going blank in some cases while mouse hovers over them during configuration
Bug Fix : New lines and tabs escaped in JSON export
HtmlParser updated to parse elements from <HTML> tag, so META tags can be extracted from the full HTML source of the page
Handles commas in keywords (Keyword Scraping)
Starts with a random proxy address from the proxy list while rotating proxies
In-built browser emulates IE 11 on default.

Download the latest version of WebHarvy Web Data Extraction Software.

↧

WebHarvy : 2 new methods of handling pagination

September 30, 2015, 3:23 am

≫ Next: WebHarvy crashes after installing the latest Windows update for Adobe Flash

≪ Previous: WebHarvy version 3.4 released !

The latest version of WebHarvy Web Scraper supports 2 new types of pagination styles for scraping data from multiple pages of websites.

Pages where pagination links are shown in sets

In these types of pages the pagination links are provided in sets. For example the first 5 pages will have direct links to load each of them at the bottom of the page. To load pages 6 to 10, an additional link should be clicked. Now each of the pages 6 to 10 will have direct links to load any of them at their page end, and also a link to load the next set of 5 pages.

WebHarvy Online Help : Scraping pages where pagination links are displayed in sets

The following video demonstrates how these types of pages can be configured and mined using WebHarvy.

When each page URL contains the page number

Suppose the pages from which you need to scrape multiple listings of data have the following format.

http://www.example.com/search/listing?keywords&pageNumber=1
http://www.example.com/search/listing?keywords&pageNumber=2
http://www.example.com/search/listing?keywords&pageNumber=3
http://www.example.com/search/listing?keywords&pageNumber=4
etc..

Pagination in this case can be handled easily by following the method below :-

1. Open WebHarvy and load http://www.example.com/search/listing?keywords&pageNumber=1.
2. Start Config
3. Select required data from the page, Follow links and select data if required.
4. Select Edit menu > Edit Options > Add/Remove URLs from Configuration
5. Paste the following URL and Apply.

http://www.example.com/search/listing?keywords&pageNumber=%%pagenumber%%

Note that the actual page number is replaced by %%pagenumber%% in the above string.

6. Stop Config
7. Start Mine. You should specify the number of pages to mine since ‘Mine all pages’ option will be disabled. WebHarvy will automatically find and load the next pages and extract data.

WebHarvy Online Help : URL page-number based auto pagination

The latest version of WebHarvy Visual Web Scraper can be downloaded from https://www.webharvy.com/download.html. Try and in case you need any assistance please do not hesitate to contact our support team.

↧

WebHarvy crashes after installing the latest Windows update for Adobe Flash

December 31, 2015, 11:07 pm

≫ Next: WebHarvy 4.0.2.125 – Multi-level Category / Multi-list Keyword scraping

≪ Previous: WebHarvy : 2 new methods of handling pagination

Microsoft released a new security update for Adobe Flash Player for Internet Explorer (IE) a few days back (Dec 29, 2015). This update has caused many software (including Skype – see Skype Crash) to crash. See http://borncity.com/win/2015/12/30/windows-10-flash-update-kb3132372-issues/ for a list of other software titles affected due to this update.

InfoWorld Article : Win10 Flash patch KB 3132372 breaks Skype, HP Solutions Center, Incredimail, games

KB3132372
https://support.microsoft.com/en-us/kb/3132372

Solution ?

The solution to this problem is to uninstall the security update – KB3132372. See How to remove updates.

Meanwhile we will try if we can update WebHarvy to overcome this issue. We are also hoping that there will be another security update from Microsoft which solves this problem since many software titles including their own Skype seems to be affected.

Update ! (Jan 5, 2016)
Microsoft has released another update to fix the issues created by KB3132372. See https://support.microsoft.com/en-us/kb/3133431 for details. We are yet to test and confirm whether this completely solves the issue.

We are extremely sorry for the inconvenience caused due to this for our existing customers and trial users. In case you have any questions or assistance please do not hesitate to contact our support.

↧

WebHarvy 4.0.2.125 – Multi-level Category / Multi-list Keyword scraping

June 20, 2016, 7:49 pm

≫ Next: WebHarvy 4.0.3.128 (Minor Update)

≪ Previous: WebHarvy crashes after installing the latest Windows update for Adobe Flash

We have introduced support for scraping multiple level categories (main categories, sub categories tree) and support for multiple input keyword lists in this release. The main features are:-

True multi-level Category Scraping

WebHarvy now supports automatically navigating category/subcategory lists of a website to extract data from the final listing pages. Know More

Support for multiple input keywords

Any number of input text fields can be populated with lists of strings/keywords during configuration. WebHarvy will automatically apply all combinations of provided keywords during the mining phase. Know More.

Capture window with new options

Run JavaScript on Page

Run specified Java Script code on page – know more. This option can be used to load elements on a page which cannot be done using the default navigation options (link-follow, click) provided by WebHarvy.

Input strings to text input fields

Strings to be input to text fields can now be made a part of the configuration. Know More. Earlier such parameters were automatically taken from the PostData of the configuration. But sometimes, with some websites, the PostData will not contain the input strings submitted and this option helps to correctly load the page displaying data during mining phase.

Extract data from Popups

Know More. Helps to extract data by clicking each listing link/button and get data from a popup window or a view in the same page populated by data. This is different from ‘Follow this link’ option because here the data is loaded on the same page (no page navigation) and different from ‘Click’ option because after clicking each link data has to be extracted from page before clicking the next link.

Option to smoothly scroll page during mining to load all contents (lazy loading)

Smooth scroll to page end to load elements which are loaded (for example lazy loading of images) only when the elements are made visible by scrolling down. Know More.

Select drop-down/list-box options

Select drop-down/list-box/combo-box options during configuration and mining. Again this option allows navigation to result pages when normal configuration is unable to make these selections and load the result page. Know More.

Other Minor Additions Include :-

Improvements in automatic scraping of multiple product images
Support for loading keyword lists directly from file
‘Capture Image’ option automatically enabled via HTML/RegEx method in applicable cases.
Name downloaded image files by value obtained from a column/cell in miner data table. More.
Allows applying ‘Capture More Content’ after selecting ‘Capture HTML’.
Quick access to items under ‘More Options’ in Capture window via toolbar buttons.
Minor bug fixes.

You may please download and try the latest version from https://www.webharvy.com/download.html.

↧

WebHarvy 4.0.3.128 (Minor Update)

December 5, 2016, 2:36 am

≫ Next: Windows Smartscreen warning while installing WebHarvy

≪ Previous: WebHarvy 4.0.2.125 – Multi-level Category / Multi-list Keyword scraping

From this release on wards WebHarvy targets (depends on) .NET 4.5 which comes pre-installed on latest Windows editions. This results in smoother installation process, doing away with .NET 3.5 download and install which was previously required. Targeting .NET 4.5 also helps WebHarvy improve performance and resource usage, and to solve issues related to crashes while trying to extract data from certain websites.

The changes in this release are :-

Depends on .NET 4.5
More support for pages where next page link is implemented in JavaScript
Handles pagination where next page link (next link or ‘show more data’ link) contains a number which varies from page to page
Minor bug fixes related to running JavaScript code on page, opening popup and following links by using regular expressions.

As always you may download and try the latest version from https://www.webharvy.com/download.html. Let us know in case you have any questions.

↧

Windows Smartscreen warning while installing WebHarvy

December 5, 2016, 3:54 am

≫ Next: WebHarvy 4.0.3.129 (Installer Update Only)

≪ Previous: WebHarvy 4.0.3.128 (Minor Update)

All WebHarvy application files and installation package are digitally signed (Comodo RSA Code Signing CA) and secured. However in case you get the following Smartscreen warning while trying to install the latest version of WebHarvy, please click the ‘More info‘ link and then click the ‘Run anyway‘ button to proceed with the installation.

The above popup message is displayed because we recently changed our .NET dependency from 3.5 to 4.5, thereby considerably reducing the installation package size, and more importantly the code signing agency of our digital certificate has been changed from GlobalSign to Comodo. So the above warning may appear till the new WebHarvy installer gets enough reputation from Microsoft which will take a few weeks time. In case you have any questions or require assistance please do not hesitate to contact our support.

↧

WebHarvy 4.0.3.129 (Installer Update Only)

December 29, 2016, 7:05 pm

≫ Next: Scraping high resolution images from pinterest.com

≪ Previous: Windows Smartscreen warning while installing WebHarvy

This update addresses problems in installing .NET 4.5 on Windows 7 (and earlier Windows versions where .NET 4.5 is not present) during installation process. Only the installer has been updated in this release and WebHarvy application files are unchanged compared to the just previous version. So in case you are already running 4.0.3.128 you can ignore this version.

You may download and try the latest version from https://www.webharvy.com/download.html. Let us know in case you have any questions.

↧

Scraping high resolution images from pinterest.com

January 16, 2017, 4:06 am

≫ Next: WebHarvy 4.1.5.141 released

≪ Previous: WebHarvy 4.0.3.129 (Installer Update Only)

In this blog post, we will take a look at how to scrape images from www.pinterest.com in their full sizes.We follow a two stage extraction process to capture the high-res images from pinterest.com.

In the first extraction stage, we capture the image URLs which are present in the listings page. These URLs actually point to smaller sized images (236 Pixels). Then using any Text Editor, we replace the /236x/ with /564x/ in all the URLs.

For example the URL : https://s-media-cache-ak0.pinimg.com/236x/99/….

is modified to : https://s-media-cache-ak0.pinimg.com/564x/99/….

In the second extraction stage, we use ‘Add URLs‘ method to add the modified URLs and scrape the full sized images ((564 Pixels) from each of these URLs using a single WebHarvy configuration.

This method is displayed in the following video :

Links:-

Have any questions ?

↧

WebHarvy 4.1.5.141 released

May 2, 2017, 3:09 am

≫ Next: WebHarvy based on Google Chrome Released (version 5.0.1.148)

≪ Previous: Scraping high resolution images from pinterest.com

The main changes in this release are :-

Pagination via JavaScript – see https://www.webharvy.com/tour3.html#JS
This powerful feature is the main highlight of this release. When all other methods of pagination fails, this method, where you can directly provide a JavaScript code which when run would load the next page, can be used.
Increased size of virtual browser used by miner
The dimensions of miner’s virtual browser has been increased. This solves issues related with websites whose layout changes when the browser has a smaller window dimension (mobile layout). This also helps the miner to load more items in a single page and scroll, in case of websites which display data based on the size of the browser window.
Support for ‘Load more content‘ & ‘Scroll to load next page‘ type pagination even when the real listing page is reached by clicking links/buttons from the start page.
In earlier versions if the listing page loads more data in same page via a button/link click or scroll and if initial navigation (click, java-script etc.) is required in the configuration itself to load the listing page from another start page, then pagination would fail. This release removes this limitation.
More support for extracting data from popups.
Popups now handle clicks and javascript. This can be used to close the popup window, in cases where closing the currently opened popup is required to open the next one.
SQL data export encoding issue related to foreign languages fixed.
Encoding issues while exporting text in non-English languages like Chinese fixed.
Other minor bug fixes

As always you may download and install the latest version from https://www.webharvy.com/download.html.

↧

WebHarvy based on Google Chrome Released (version 5.0.1.148)

September 13, 2017, 12:26 am

≫ Next: WebHarvy 5.1 released (Includes direct Excel Export)

≪ Previous: WebHarvy 4.1.5.141 released

This release comes with least bells and whistles since we have not added features or changed cosmetics of the software. But still, this is a major upgrade. The change is all internal.

WebHarvy has been using Microsoft’s Internet Explorer (IE) as its internal browser since inception. Microsoft stopped supporting IE a few years back when they introduced the Edge browser.

So WebHarvy had to switch to another solution to power its internal browser and we believe using Google’s Chrome Browser Project is the way forward. This makes WebHarvy more stable, faster and secure. Switching to Chrome also opens up the possibility of porting the software to other platforms like Mac and Linux.

You may download and install the latest version which is based on Chrome browser from the following link.

http://www.webharvy.com/webharvysetup.exe

As mentioned before the change from IE to Chrome is internal to the software and transparent to the user interface. So, the configuration process and user interface of WebHarvy remains the same.

Minor Changes

For scraping data from sites which require login, the steps have been simplified. You no longer need to login to the website separately from IE. See https://www.webharvy.com/articles/sites-requiring-login.html
The ‘Internet Options’ menu option under Edit menu has been removed. Instead a new Browser options tab has been added in Settings window.

Running configuration files created using the older version which was based on IE on this new version based on Chrome

Configuration files created using the old version should normally work fine with the new version which is based on Chrome, but there will be exceptions. In such cases we recommend that you create a new configuration using the latest version.

As always, in case you have an questions or need assistance you may contact our support at https://www.webharvy.com/support.html

↧

WebHarvy 5.1 released (Includes direct Excel Export)

January 7, 2018, 11:16 pm

≫ Next: WebHarvy 5.2 | UI revamp + Oracle db support

≪ Previous: WebHarvy based on Google Chrome Released (version 5.0.1.148)

The following are the changes in 5.1.0.152 :

New Features :

Excel export – supports directly saving mined data as an Excel file (details)
Handles page numbers in JavaScript code to load next page data (details)
Updated Chromium engine from V54 to V62

Minor changes :

Default values of ‘Enable Plugins’ and ‘Enable Browser Security’ in Browser Settings set to false (details)
Browser address bar can be used for Google search

Bug fixes :

Fixed issues related to handling headers and post data for HTTP requests
Fixed issue in selecting data using mouse when Zoom-level of browser is not equal to 1 (zoomed in or zoomed out)
Text formatting issues (line-breaks, spaces) in Capture window fixed
Fixed issue where order of applying capture-html and capture-more-content was relevant (for applying regex to follow links or to capture images)
Bug fix in editing keywords. With the previous version changing the first keyword was not possible.
Minimizes memory usage in mining thread by limiting the number of browser instances created

As always, the latest version may be downloaded and installed from the following page :

https://www.webharvy.com/download.html

↧

WebHarvy 5.2 | UI revamp + Oracle db support

March 26, 2018, 6:37 am

≫ Next: WebHarvy’s new user interface

≪ Previous: WebHarvy 5.1 released (Includes direct Excel Export)

Changes in 5.2 are mainly related to user interface and experience. The most visible change is the introduction of the ribbon menu system for providing easy access to most software features.

In addition to the main interface, other windows like Scheduler / Export etc. have also been updated. The export functionality (to file or database) has now been made cancel-able. User can now cancel an ongoing export to file or database.

As with every release, the Chrome browser has been updated as well. Issues related to URL update (in address bar) while navigating links in some websites has been fixed with this update.

An important non-UI feature addition in this release is the support added for exporting data to Oracle database. The default file export option is changed from CSV to Excel format.

All main settings are now displayed in snippet format in browser view’s status bar.

smarthelp

Help (videos, articles) related to the website loaded in the configuration browser is automatically loaded and displayed as a smart tip.

Miner Settings can now be opened and changed directly from the Miner window.

JavaScript can now be typed in multi-line code format.

Browser settings now include a new option to share user location to the loaded page.

In addition to the above this release also contains minor bug fixes and improvements as always. You may download and try the latest version from https://www.webharvy.com/download.html

↧

WebHarvy’s new user interface

May 4, 2018, 3:05 am

≫ Next: WebHarvy’s new blog at blog.webharvy.com

≪ Previous: WebHarvy 5.2 | UI revamp + Oracle db support

We have significantly updated the user interface of WebHarvy in the latest version available in our website and the following video explains how the features and options are laid out in the new UI. Existing users of older versions will find this video useful so that they know where to look for specific features and options.

↧

WebHarvy’s new blog at blog.webharvy.com

June 3, 2018, 10:14 pm

≫ Next: WebHarvy 5.3 (Parallel Mining, Chrome Developer Tools)

≪ Previous: WebHarvy’s new user interface

We are moving all posts related to WebHarvy from our company blog here to WebHarvy’s own dedicated blog at www.webharvy.com/whblog . All new articles, release updates, tips and tricks and case studies related to web scraping using WebHarvy will published at WebHarvy Blog. So please make sure that you subscribe to and bookmark the new blog.

http://webharvy.com/whblog/

http://webharvy.com/whblog/feed/

↧

WebHarvy 5.3 (Parallel Mining, Chrome Developer Tools)

October 23, 2018, 8:54 pm

≪ Previous: WebHarvy’s new blog at blog.webharvy.com

‘How to increase mining speed ?‘ was one of the most commonly asked questions by our users. With previous versions, the main limitation was that when links had to be followed from the starting page to get each listing details, the miner took more time to scrape a page full of listings. This is because WebHarvy used to sequentially load links one after the other to scrape data.

Parallel Mining

Instead of processing links to be followed and extracted one after the other, the latest update of WebHarvy processes them in bulk, in parallel, using multiple mining threads. You can set the maximum number of parallel mining threads which WebHarvy uses in Advanced Miner Options window as shown below.

Providing a higher value for ‘Maximum number of parallel mining threads’ option in the above window will increase mining speed. But, to run more threads in parallel, WebHarvy will require more memory, processing power and internet-bandwidth. So we recommend that you increase this setting only based on your system’s CPU, installed physical memory (RAM) and internet speed.

Chrome Developer Tools

This feature is for power users who are familiar with web page internals like HTML, DOM structure and JavaScript. We use this tool extensively while supporting our customers with not so straightforward scraping scenarios and complex websites.

Chrome Developer Tools allow you to easily inspect the internal structure of a web page, see how the page is organised, view the HTML and data hidden in HTML source and devise methods to extract them. You can also find the JavaScript code run when buttons/links are clicked and directly call them using these features.

More Accurate Automatic Sub-Text Selection

To scrape only a portion of the text displayed in the Capture window, you can highlight the required portion with mouse. We have improved the accuracy of this method, especially when the text selected is in between delimiter characters like currency symbols, punctuation/special characters, new line/space etc.

Improvements And Bug Fixes

Improved select dropdown option. This option now reflects the selection (selected item change) on the page. Earlier separate JavaScript code needed to be run by the user to reflect page change upon dropdown list selection.
Miner now scrolls the page before clicking on Load More links. This is done to make sure that the ‘load more’ link is visible and loaded before miner tries to click it.
When text scaling in Windows is not set to 100% (which is the recommended setting on most systems), it was not possible to click and correctly select the required data items during configuration. This issue is fixed in this version. Configuration time data selection works irrespective of text scaling.
Fixed issue related to downloading images behind SSL.
Non-visibility of miner window in multi monitor systems when monitor configuration changes is fixed.
Earlier, the Capture window would become unresponsive for a second or two after applying Regular Expression on HTML. This unresponsive state has been removed.
Added browser zoom level and number of parallel mining threads info in status bar of configuration browser.
Fixed issue with loading and displaying upgrade purchase page in cases where user’s license has expired.
Disabled ‘Mine all pages/Number of pages to mine’ controls while mining is in progress.
Updated internal browser to a more recent version of Chromium.

↧