老狗啃爬虫-Cookies的使用之Selenium

老狗啃骨头 @Veiking 2021-02-18

这一次我们即是接着上篇的Selenium登录，拿到它的Cookie，然后在重新打开的浏览器中，通过加载Cookie来达到用户已经登录的状态。利用Selenium实现登录、并通过获取的Cookie去访问受限页面，可以让我们写的爬虫在更多的场景下进行劳作。但还是要强调一点，如有违法犯纪，损人利己，切不可为，要在法律法规允许的范围内施展才艺

书接上回，这一次我们即是接着上篇的Selenium登录，拿到它的Cookie，然后在重新打开的浏览器中，通过加载Cookie来达到用户已经登录的状态。

在百度新闻移动页面那里，我们找了个一个简单的个人中心页：

明显，这个“退出登录”元素，足以辨别出是否已经登录。没登录肯定是看不到这个元素的，我们就拿它做已经登录状态的标识打印。

在上一篇Selenium模拟登录的实现代码基础上，我们稍微加点：

package cn.veiking.selenium;

import java.util.Set;

import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.Dimension;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.context.properties.EnableConfigurationProperties;
import org.springframework.stereotype.Component;

import cn.veiking.base.common.logs.SimLogger;
import cn.veiking.base.config.SeleniumConfig;

/**
 * @author :Veiking
 * @version :2020年12月30日 说明 :使用Selenium模拟登录
 */
@Component
@EnableConfigurationProperties(SeleniumConfig.class)
public class SeleniumForLogin {
    private SimLogger logger = new SimLogger(this.getClass());
    // Cookie
    private static Set cookies;
    @Autowired
    private SeleniumConfig seleniumConfig;

    private WebDriver webDriver;
    private ChromeOptions options;
    private int sleepTime = 1000;

    private static String loginname = "dog@veiking"; // 账号
    private static String password = "********"; // 密码

    // 模拟登录
    public void login() {
        // 初始数据
        checkInit();
        String url = "https://wappass.baidu.com/passport/?login&u=#/password_login"; // 百度移动端登录地址
        try {
            // 打开百度移动端登录页
            webDriver.get(url);
            // 输入用户名
            webDriver.findElement(By.xpath("//*[@id=\"naPassWrapper\"]/section/article[1]/form/section[1]/div/input"))
                    .clear();
            webDriver.findElement(By.xpath("//*[@id=\"naPassWrapper\"]/section/article[1]/form/section[1]/div/input"))
                    .sendKeys(loginname);
            // 输入密码
            webDriver.findElement(By.xpath("//*[@id=\"naPassWrapper\"]/section/article[1]/form/section[2]/div/input"))
                    .clear();
            webDriver.findElement(By.xpath("//*[@id=\"naPassWrapper\"]/section/article[1]/form/section[2]/div/input"))
                    .sendKeys(password);
            // 点击登录
            webDriver.findElement(By.xpath("//*[@id=\"naPassWrapper\"]/section/article[1]/form/input")).click();
            // 充分休息，要人工输入百度那个转圈圈的验证码
            WebElement loginSeccess = null;
            while (null == loginSeccess) {
                try {
                    // 登录成功的页面右上角有个用户头像标识，可以用做成功登录辨别
                    loginSeccess = webDriver.findElement(By.xpath("//*[@id=\"personal-center\"]/div[2]/div[1]/img"));
                } catch (Exception e) {
                    Thread.sleep(sleepTime);
                }
            }
            logger.info("Login success !");
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        saveCookie();
        webDriver.quit(); // 关闭所有相关窗口，退出
        webDriver = null;
    }

    // 模拟载入Cookie
    public void checkCookie() {
        checkInit();
        String url = "https://news.baidu.com/news#/profile/home"; // 百度移动端个人私有页
        // 打开百度移动端个人私有页
        webDriver.get(url);
        // 加载cookies
        for (Cookie cookie : cookies) {
            webDriver.manage().addCookie(cookie);
        }
        // 刷新，页面内容有些是动态的，不刷新还没加载
        webDriver.navigate().refresh();
        // 稍等片刻，给页面留时间渲染
        try {
            Thread.sleep(sleepTime);
        } catch (InterruptedException e) {
        }
        String flag = webDriver.findElement(By.xpath("//*[@id=\"profile_view\"]/div[2]/div[3]/div/span")).getText();
        logger.info("Login success and checkCookie [flag={}]", flag);

        webDriver.quit(); // 关闭所有相关窗口，退出
        webDriver = null;
    }

    // 初始
    private void checkInit() {
        // 实例化WebDriver前必须配置
        System.getProperties().setProperty("webdriver.chrome.driver", seleniumConfig.getChromedriverPath());
        // 浏览器位置
        options = new ChromeOptions();
        options.addArguments("--user-agent=Galaxy S5"); // 设置手机设备-浏览器访问
        webDriver = new ChromeDriver(options);
        webDriver.manage().window().setSize(new Dimension(500, 800)); // 浏览器size
    }

    // 处理Cookie
    private void saveCookie() {
        // 获取cookie信息*/
        cookies = webDriver.manage().getCookies();
        for (Cookie cookie : cookies) {
            logger.info("Login success and cookie [name={}|value={}]", cookie.getName(), cookie.getValue());
        }
    }

}

我们在登录成功后，做一下saveCookie()处理，然后在checkCookie()方法里重新实例一个WebDriver（登录的WebDriver用完之后已经销毁了，浏览器关闭，即销毁），然后把登录时获取到的Cookie装载进去，我们等会儿看看页面打开的效果。
这里边加了一条刷新操作，是因为百度的这个页面的数据是根据Cookie动态加载的，不刷新可能看不到效果。

直接在之前的测试类里加了一句：

package cn.veiking.selenium;

import org.junit.jupiter.api.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;

import cn.veiking.StartTest;
import cn.veiking.base.common.logs.SimLogger;

/**
* @author    :Veiking
* @version    :2020年12月30日
* 说明        :SeleniumForLoginTest 测试
*/
@RunWith(SpringRunner.class)
@SpringBootTest(classes = StartTest.class)
public class SeleniumForLoginTest {
    SimLogger logger = new SimLogger(this.getClass());

    @Autowired
    private SeleniumForLogin seleniumForLogin;

    @Test
    public void testLogin() {
        long startTime, endTime;
        logger.info("SeleniumForLoginTest [start={}] ", "开始爬取数据");
        startTime = System.currentTimeMillis();

        // 模拟登录
        seleniumForLogin.login();
        // 模拟载入Cookie
        seleniumForLogin.checkCookie();

        endTime = System.currentTimeMillis();
        logger.info("SeleniumForLoginTest [end={}] ", "爬取结束，耗时约" + ((endTime - startTime) / 1000) + "秒");
    }
}

执行程序，运行测试 …
可以看到，浏览器打开，一顿操作猛如虎，登录成功后，关了。
这时候我们可以留意下日志信息：

Cookie数据到手。
然后新的浏览器也会自动打开，我们看到，确实如先前预想的那样，加载Cookie即可做到用户已登录的状态，这样我们就可以拿着Cookie，让蜘蛛去爬我们想获取数据的页面了！

利用Selenium实现登录、并通过获取的Cookie去访问受限页面，可以让我们写的爬虫在更多的场景下进行劳作。但还是要强调一点，如有违法犯纪，损人利己，切不可为，在法律法规允许的范围内，让爬虫极尽所能，拯救地球、造福人类，才是正道。

老狗啃爬虫-Cookies的使用之Selenium

老狗啃爬虫-Cookies的使用之Selenium

Recommend

Python爬虫编程思想（101）：使用Selenium管理Cookies

老狗啃爬虫-从抓取到存储之Pipeline

老狗啃爬虫-动态页面爬取之Selenium

老狗啃爬虫-爬虫学习总结（示例源码）

老狗啃爬虫-爬虫方案选择之WebMagic

老狗啃爬虫-开发准备之Maven动员

老狗啃爬虫-小爬虫初长成之PageProcessor

老狗啃爬虫-爬虫必知基础Jsoup和Xsoup

老狗啃爬虫-便捷的元素定位之Selectable

老狗啃爬虫-URL去重之Scheduler

About Joyk