玩蛇网提供最新Python编程技术信息以及Python资源下载!

下载豆瓣小组的帖子并只看楼主的Python实现

python 培训

如何能做到下载豆瓣小组的帖子,然后只看楼主的楼层呢?无依赖第三方,python2.X测试可用。python3没做过测试,代码无水准,直接字符串查找来解析HTML,仅仅够实现效果,代码不够好看,日后改进。

# Download douban group topic 
# Anonymous 2011-12@SZ

# 基本设置
post_url = "http://www.douban.com/group/topic/23871584/"
post_start =0
split_prefix = '<li class="clearfix">'
poster_user_id = ''
page_size = 100
save_filename = 'douban-post.txt'
log_falg = True

f = open(save_filename,'w')

# start 
import urllib2
import sys
import time

print 'Start ... '
html = urllib2.urlopen(post_url+"?start="+str(post_start)).read()

if html.index(split_prefix)<1:
	print 'This post has no content: url='+post_url+str(post_start)
	sys.exit(0)

cc = html.find('topic-content')
t_html = ''.join(html[cc:cc+150])
poster_user_id = t_html[t_html.index('people')+7: t_html.index('img')-4]
#www.iplaypython.com

c = 0
page = 0
while True:
	page += 1
	if log_flag: print '\npage=%d * %d' %(page,page_size)
	c = (page-1) * page_size

	# posts in current page
	posts = html.split(split_prefix)[1:]
	for p in posts:
		try:
			if (p.find('people/')>1):
				c += 1
				user_id = p[p.index('people/')+7:p.index('/"><img class="pil"')]
				if(user_id == poster_user_id):
					ss = '\n['+p[p.index('<h4>')+4:p.index('<h4>')+23] +" ]" +str(c)+"F "+ user_id + " : "+ p[p.index('<p>')+3:p.index('</p>')]
					if log_flag: print ss
					f.write(ss)
					f.flush()
		except ValueError:
			print '[error] Parse post error' 
			continue

	# next page
	post_start += page_size;
	html = urllib2.urlopen(post_url+"?start="+str(post_start)).read()
	if html.find(split_prefix)<1:
		if log_falg: print 'Post is over !'
		break

	time.sleep(3) # let douban server sleep 3 seconds .


f.close()
print 'Finished !'

玩蛇网原创,转载请注明文章出处和来源网址:http://www.iplaypython.com/code/other/o2372.html



微信公众号搜索"玩蛇网Python之家"加关注,每日最新的Python资讯、图文视频教程可以让你一手全掌握。强烈推荐关注!

微信扫描下图可直接关注

玩蛇网Python新手QQ群,欢迎加入: ① 240764603 玩蛇网Python新手群
文章发布日期:2016-04-14 11:02 玩蛇网 www.iplaypython.com

评论列表(网友评论仅供网友表达个人看法,并不表明本站同意其观点或证实其描述)
相关文章推荐
别人正在看
特别推荐
去顶部去底部