进程池和线程池爬取51job
Published in:2020-10-12 |

python 进程池和线程池学习

进程池和线程池爬取51job

进程池爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from multiprocessing import Pool  # 进程池
import requests
import json
import re

def run(page):
print("开始爬取")
for i in range(1, page):
url = "https://search.51job.com/list/010000,000000,0000,00,9,99,%25E9%2594%2580%25E5%2594%25AE,2,{}.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=".format(
i)
res = requests.get(url, headers={'user-agent': "baiduspider"}).text
rule = '__SEARCH_RESULT__ = (.*?)</script>'
job_dict = json.loads(re.findall(rule, res)[0])
for job in job_dict['engine_search_result']:
if not job['providesalary_text']:
job['providesalary_text'] = "面议"
print(job['job_name'], job['providesalary_text'])


if __name__ == '__main__':
pool = Pool(10)
for i in range(100):
pool.apply_async(run, (i,))
pool.close()
pool.join()
print("爬取结束")

线程池爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import requests
import json
import re
from concurrent.futures import ThreadPoolExecutor
def run(url):
print("开始爬取")
res = requests.get(url, headers={'user-agent': "baiduspider"}).text
rule = '__SEARCH_RESULT__ = (.*?)</script>'
job_dict = json.loads(re.findall(rule, res)[0])
for job in job_dict['engine_search_result']:
if not job['providesalary_text']:
job['providesalary_text'] = "面议"
print(job['job_name'], job['providesalary_text'])


if __name__ == '__main__':
pool = ThreadPoolExecutor(max_workers=10)
for i in range(1, 1000):
url = "https://search.51job.com/list/010000,000000,0000,00,9,99,%25E9%2594%2580%25E5%2594%25AE,2,{}.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=".format(
i)
pool.submit(run, url)

注意

1、多线程适合IO密集型程序

   2、多进程适合CPU密集运算型程序

协程:

通俗的讲就是比线程还要小的线程,所以才叫微线程。

  1. 协程,也叫微线程,纤程
    协程是抽象的–>没有协程对象
  2. 协程的作用在于协调程序的执行
    如果程序运行过程中出现问题,可以切换到另一个执行线路上.
  3. 协程实现的关键点是:挂起
    yield
1
2
3
4
5
6
优点:
    1、使用高并发、高扩展、低性能的;一个CPU支持上万的协程都不是问题。所以很适合用于高并发处理。
    2、无需线程的上下文切换开销(乍一看,什么意思呢?我们都知道python实际上是就是单线程,那都是怎么实现高并发操作呢,就是CPU高速的切换,每个任务都干一点,最后看上去是一起完事儿的,肉眼感觉就是多线程、多进程)

缺点:
    1、无法利用CPU的多核优点,这个好理解,进程里面包含线程,而协程就是细分后的线程,也就是说一个进程里面首先是线程其后才是协程,那肯定是用不了多核了,不过可以多进程配合,使用CPU的密集运算,平时我们用不到。
Prev:
58同城一线城市房源信息
Next:
采集疫情数据并可视化