Sunday, December 25, 2011

Downloading images with DEiXTo and wget

Many people often download pictures and photos from various websites of interest. Sometimes though, the number of images that someone wants to download from certain pages is large. So large that doing it manually is almost prohibitive. Therefore, an automation tool is often needed to save users time and repetitive effort. Of course, towards this goal, DEiXTo can help.
    Let's suppose that you want to get all images from a specific web page (with respect to terms of use). You can easily build a simple extraction rule by pointing at an image, using it as a record instance and setting the IMG rule node as "checked" (right click on the IMG node and select "Match and Extract Content"). The resulting pattern will be like this:
    Then, via executing the rule you can extract the "src" attribute (essentially the URI) of each image found on the page and export the results to a txt file, let's say image_urls.txt. And last, you can use GNU Wget (a great free command line tool) in order to retrieve the files. You can download a Windows (win32) version of wget hereFor example, on Windows you can then just open a DOS command prompt window, change the current working directory to the folder wget is stored (via the 'cd' command) and enter:
wget.exe -i image_urls.txt 
where image_urls.txt is the file containing the URIs of images. And voilà! The wget utility will download all the images of the target page for you!
    What about getting images from multiple pages? You will have to explicitly provide the target URLs either through an input txt file or via a list. Both ways can be specified in the Project Info tab of the DEiXTo GUI tool.
    Thus, if you have the target URLs at hand or you can extract them with another wrapper (generating a txt file), then  you can just pass them as input to the new image wrapper and the latter will do the laborious work for you.
    In case all the above are a bit unclear, we have built a sample wrapper project file (imdb_starwars.wpf) that downloads all Star Wars (1977) thumbnail photos from the corresponding imdb page. Please note that we set the agent to follow the Next page link so as to gather all thumbnails since they are scattered across multiple pages. However, if you would like to get the large size photos you will have to add another scraping layer for extracting the links of the pages containing the full size pictures.
    Anyway, in order to run the sample wrapper for the thumbnails, you should open the wpf (through the Open button in the Project Info tab) and then press the "Go!" button. Or alternatively you can use the command line executor instead on a DOS prompt:
deixto_executor.exe imdb_starwars.wpf
Finally, you will have to pass the image_urls.txt output file to wget in order to download all thumbnails and get the job done! May the Force be with you! :)

No comments:

Post a Comment