Web scraping with Scrapy

This is the place for queries that don't fit in any of the other categories.

Web scraping with Scrapy

Postby Crimson King » Fri Jun 28, 2013 4:52 am

Nice tutorial Setrofim.

I just have a little trouble using the pipelines: with your example as a base i wanted to not only scrape the information of every comic but also download it's image to a directory (configured in settings.py).

I was able to do that with an example pipeline i found on the scrapy docs:

Code: Select all
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)
   
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item


My only issue is the way the files are stored (using SHA1 hash), for instance:

Code: Select all
http://www.example.com/image.jpg
would be stored as
Code: Select all
3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
(this is the example shown on the docs)

Is there any way to change the file's name before being stored?

Thanks!

PS: as future reference, should i create a new topic instead of asking questions here?
User avatar
Crimson King
 
Posts: 131
Joined: Fri Mar 08, 2013 2:42 pm
Location: Buenos Aires, Argentina

Re: Web scraping with Scrapy

Postby stranac » Fri Jun 28, 2013 6:42 am

From the source code(fond on bitbucket), it looks like the image name is generated in the image_key() method of the ImagesPipeline class.
Implemeting it in your subclass should change the name.(but I can't test)

And yes, you should post a new topic instead of posting in a tutorial thread.
I would move the post, but it's a pain to do on my phone.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1209
Joined: Thu Feb 07, 2013 3:42 pm

Re: Web scraping with Scrapy

Postby Crimson King » Fri Jun 28, 2013 11:35 pm

stranac wrote:From the source code(fond on bitbucket), it looks like the image name is generated in the image_key() method of the ImagesPipeline class.
Implemeting it in your subclass should change the name.(but I can't test)

And yes, you should post a new topic instead of posting in a tutorial thread.
I would move the post, but it's a pain to do on my phone.


Thank you stranac, I added the image_key() method and it worked perfectly.
User avatar
Crimson King
 
Posts: 131
Joined: Fri Mar 08, 2013 2:42 pm
Location: Buenos Aires, Argentina


Return to General Coding Help

Who is online

Users browsing this forum: snippsat and 3 guests