4

midway的使用教程 - ataola

 1 year ago
source link: https://www.cnblogs.com/cnroadbridge/p/16361744.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

一、写在前面

先说下本文的背景,这是一道笔者遇到的Node后端面试题,遂记录下,通过本文的阅读,你将对楼下知识点有所了解:

  • midway项目的创建与使用
  • typescript在Node项目中的应用
  • 如何基于Node自身API封装请求
  • cheerio在项目中的应用
  • 正则表达式在项目中的应用

二、midway项目的创建和使用

第一步: 输入命令**npm init midway**初始化midway项目

第二步:选择**koa-v3 - A web application boilerplate with midway v3(koa)**,按下回车

➜  www npm init midway
npx: installed 1 in 4.755s
? Hello, traveller.
  Which template do you like? …

 ⊙ v3
▸ koa-v3 - A web application boilerplate with midway v3(koa)
  egg-v3 - A web application boilerplate with midway v3(egg)
  faas-v3 - A serverless application boilerplate with midway v3(faas)
  component-v3 - A midway component boilerplate for v3

 ⊙ v2
  web - A web application boilerplate with midway and Egg.js
  koa - A web application boilerplate with midway and koa

第三步:输入你要创建的项目名称,例如**“midway-project”****, ****What name would you like to use for the new project? ‣ midway-project**

第四步:跟着提示走就好了,分别执行**cd midway-project****npm run dev**, 这个时候如果你没有特别设置的话,打开**http://localhost:7001**就可以看到效果了

➜  www npm init midway
npx: installed 1 in 4.755s
✔ Hello, traveller.
  Which template do you like? · koa-v3 - A web application boilerplate with midway v3(koa)
✔ What name would you like to use for the new project? · midway-project
Successfully created project midway-project
Get started with the following commands:

$ cd midway-project
$ npm run dev


Thanks for using Midway

Document ❤ Star: https://github.com/midwayjs/midway



   ╭────────────────────────────────────────────────────────────────╮
   │                                                                │
   │      New major version of npm available! 6.14.15 → 8.12.1      │
   │   Changelog: https://github.com/npm/cli/releases/tag/v8.12.1   │
   │               Run npm install -g npm to update!                │
   │                                                                │
   ╰────────────────────────────────────────────────────────────────╯

➜  www

具体的官网已经写的很详细了,不再赘述,参见:

三、如何抓取百度首页的内容

3.1、基于node自身API封装请求

在node.js的https模块有相关的get请求方法可以获取页面元素,具体的如下请参见:,我把它封装了一下

import { get } from 'https';

async function getPage(url = 'https://www.baidu.com/'): Promise<string> {
  let data = '';
  return new Promise((resolve, reject) => {
    get(url, res => {
      res.on('data', chunk => {
        data += chunk;
      });

      res.on('error', err => reject(err));

      res.on('end', () => {
        resolve(data);
      });
    });
  });
}

额,你要测试这个方法,在node环境的话,其实也很简单的,这样写

(async () => {
  const ret = await getPage();
  console.log('ret:', ret);
})();

四、如何获取对应标签元素的属性

题目是,从获取的HTML源代码文本里,解析出id=lg的div标签里面的img标签,并返回此img标签上的src属性值

4.1、cheerio一把梭

如果你没赶上JQuery时代,那么其实你可以学下cheerio这个库,它有这个JQuery类似的API ------为服务器特别定制的,快速、灵活、实施的jQuery核心实现.具体的参见:,github地址是:

在了解了楼上的知识点以后呢,那其实就很简单了,调调API出结果。下文代码块的意思是,获取id为lg的div标签,获取它的子标签的img标签,然后调用了ES6中数组的高阶函数map,这是一个幂等函数,会返回与输入相同的数据结构的数据,最后调用get获取一下并字符串一下。

 @Get('/useCheerio')
  async useCheerio(): Promise<IPackResp<IHomeData>> {
    const ret = await getPage();
    const $ = load(ret);
    const imgSrc = $('div[id=lg]')
      .children('img')
      .map(function () {
        return $(this).attr('src');
      })
      .get()
      .join(',');

    return packResp({ func: 'useCheerio', imgSrc });
  }

4.2、正则一把梭

看到一大坨字符串,嗯,正则也是应该要想到的答案。笔者正则不太好,这里写不出一步到位的正则,先写出匹配id为lg的div的正则,然后进一步匹配对应的img标签的src属性,是的,一步不行,那咱就走两步,最终结果和走一步是一样的。

 @Get('/useRegExp')
  async useRegExp(): Promise<IPackResp<IHomeData>> {
    const ret = await getPage();
    // 匹配id为lg的div正则
    const reDivLg = /(?<=<div.*?id="lg".*?>)(.*?)(?=<\/div>)/gi;
    // 匹配img标签的src属性
    const reSrc = /<img.*?src="(.*?)".*?\/?>/i;
    const imgSrc = ret.match(reDivLg)[0].match(reSrc)[1];

    return packResp({ func: 'useRegExp', imgSrc });
  }

五、单元测试

这里要实现两个测试点是,1、如果接口请求时间超过1秒钟,则Assert断言失败, 2、如果接口返回值不等于"//www.baidu.com/img/bd_logo1.png",则Assert断言失败
midway集成了jest的单元测试, 官网已经写的很详细了,具体的参见:

关于1秒钟这事,我们可以计算下请求的时间戳,具体的如下:

const startTime = Date.now();
// make request
const result: any = await createHttpRequest(app).get('/useRegExp');
const cost = Date.now() - startTime;

最后再断言下就好了 expect(cost).toBeLessThanOrEqual(1000);

最终的代码如下:

  it.only('should GET /useRegExp', async () => {
    const startTime = Date.now();
    // make request
    const result: any = await createHttpRequest(app).get('/useRegExp');
    const cost = Date.now() - startTime;

    // 2. 如果接口请求时间超过1秒钟,则Assert断言失败
    const {
      data: { imgSrc },
    } = result.body as IPackResp<IHomeData>;

    expect(imgSrc).not.toBe('//www.baidu.com/img/bd_logo1.png');
    notDeepStrictEqual(imgSrc, '//www.baidu.com/img/bd_logo1.png');
    expect(cost).toBeLessThanOrEqual(1000);
    expect(imgSrc).toBe('//www.baidu.com/img/flexible/logo/pc/index.png');
    deepStrictEqual(imgSrc, '//www.baidu.com/img/flexible/logo/pc/index.png');
  });

  it.only('should GET /useCheerio', async () => {
    const startTime = Date.now();
    // make request
    const result: any = await createHttpRequest(app).get('/useCheerio');
    const cost = Date.now() - startTime;

    const {
      data: { imgSrc },
    } = result.body as IPackResp<IHomeData>;

    expect(imgSrc).not.toBe('//www.baidu.com/img/bd_logo1.png');
    notDeepStrictEqual(imgSrc, '//www.baidu.com/img/bd_logo1.png');
    expect(cost).toBeLessThanOrEqual(1000);
    expect(imgSrc).toBe('//www.baidu.com/img/flexible/logo/pc/index.png');
    deepStrictEqual(imgSrc, '//www.baidu.com/img/flexible/logo/pc/index.png');
  });

六、写在后面

这里,如果你眼睛够细,你会发现一个很有意思的现象,你从浏览器打开百度首页,然后控制台输出楼上的需求是这样的

const lg = document.getElementById('lg');
undefined
lg.childNodes.forEach((node) => { if(node.nodeName.toLowerCase() === 'img') { console.log(node.src) } })
2VM618:1 https://dss0.bdstatic.com/5aV1bjqh_Q23odCf/static/superman/img/logo/logo_white-d0c9fe2af5.png
VM618:1 https://www.baidu.com/img/PCfb_5bf082d29588c07f842ccde3f97243ea.png
undefined

然而,通过Node自带的https库,你会发现//www.baidu.com/img/flexible/logo/pc/index.png这个
咦,震惊.jpg. 发生了什么?莫不是度度做了什么处理?
于是乎,我用wget测试了下wget -O baidu.html [https://www.baidu.com](https://www.baidu.com), 发现正常发请求是这样的

➜  tmp wget -O baidu.html https://www.baidu.com
--2022-06-10 00:36:17--  https://www.baidu.com/
Resolving www.baidu.com (www.baidu.com)... 182.61.200.6, 182.61.200.7
Connecting to www.baidu.com (www.baidu.com)|182.61.200.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2443 (2.4K) [text/html]
Saving to: ‘baidu.html’

baidu.html                                                      100%[=====================================================================================================================================================>]   2.39K  --.-KB/s    in 0s

2022-06-10 00:36:18 (48.3 MB/s) - ‘baidu.html’ saved [2443/2443]

➜  tmp cat baidu.html
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a>  <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号  <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
➜  tmp

但是当我给上模拟浏览器的请求后wget --user-agent="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16" [https://www.baidu.com](https://www.baidu.com)

➜  tmp wget --user-agent="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16"  https://www.baidu.com
--2022-06-10 00:38:53--  https://www.baidu.com/
Resolving www.baidu.com (www.baidu.com)... 182.61.200.7, 182.61.200.6
Connecting to www.baidu.com (www.baidu.com)|182.61.200.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

index.html                                                          [ <=>                                                                                                                                                  ] 350.76K  --.-KB/s    in 0.01s

2022-06-10 00:38:53 (35.1 MB/s) - ‘index.html’ saved [359175]

➜  tmp

这个是跟浏览器的行为一直的,输出的结果是三个img标签。

2055171-20220610005339008-2042947203.jpg

关于Node.js的https库对这块的处理我没有去深究了,我就是通过楼上的例子猜了下,应该是它那边服务器做了对客户端的相关判定,然后返回相应html文本,所以这里想办法给node.js设置一个楼上的user-agent我猜是可以得到跟PC一样的结果的,这个作业就交给读者了,欢迎在下方留言讨论!

项目地址: https://github.com/ataola/play-baidu-midway-crawler
线上访问: http://106.12.158.11:8090/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK