玩蛇网提供最新Python编程技术信息以及Python资源下载!

python采集文章中图片的方法源码

python 培训

这是收集的一篇关于利用python语言来采集文章中图片的方法源码。

代码中用到了的模块库有:
python os模块
python time模块
python sys
python re正则
python threading

python采集文章中图片的方法源码如下:(供参考)

import os,time,sys,re,threading
import urllib

DOWNLOAD_BASEDIR = os.path.join(os.path.dirname(__file__), 'download')

DOWNLOAD_BASEURL = './download/'

os.mkdir(DOWNLOAD_BASEDIR)

def md5sum(s):
    try:
        import hashlib
        m = hashlib.md5()
        m.update(s)
        return m.hexdigest()
    except:
        import md5
        m = md5.new()
        m.update(s)
        return m.hexdigest()
    
class Download(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url
    def run(self):
##      print "downloading %s " % self.url
        f = urllib.urlopen(self.url)
        content_type,extention = f.headers.get('content-type','image/jpeg').split('/')
        if extention in ('jpeg','html'):
            extention = 'jpg'
        basename = "%s.%s" %( md5sum(self.url) , extention)
        self.filename = os.path.join(DOWNLOAD_BASEDIR, basename)
        self.local_url = DOWNLOAD_BASEURL + basename
        file(self.filename, 'wb').write(f.read())

content = file(os.path.join(os.path.dirname(__file__), 'content.html')).read()

pt=re.compile(r"""src=['"]?(http://.*?)[ '"]""")

urls = []

for url in pt.findall(content):
    urls.append(url)
print time.ctime()

#www.iplaypython.com

thread_pools = []

for url in urls:
    current = Download(url)
    thread_pools.append(current)
    current.start()

result_text = content    

for result in thread_pools:
    print "%s threads running" % threading.activeCount() 
    result.join(5)
    if not result.isAlive():
##        print "url %s saved to %s" % (result.url, result.filename)
        result_text = result_text.replace(result.url, result.local_url)

file(os.path.join(os.path.dirname(__file__), 'result.html'), 'wb').write(result_text)
print "%s threads running" % threading.activeCount()

if threading.activeCount():
    print "Can not stop"
print time.ctime()

玩蛇网原创,转载请注明文章出处和来源网址:http://www.iplaypython.com/code/graphics/gr2576.html



微信公众号搜索"玩蛇网Python之家"加关注,每日最新的Python资讯、图文视频教程可以让你一手全掌握。强烈推荐关注!

微信扫描下图可直接关注

玩蛇网Python新手QQ群,欢迎加入: ① 240764603 玩蛇网Python新手群
文章发布日期:2016-03-18 15:04 玩蛇网 www.iplaypython.com

评论列表(网友评论仅供网友表达个人看法,并不表明本站同意其观点或证实其描述)
相关文章推荐
别人正在看
特别推荐
去顶部去底部