从excel里面读取图片 | 代码魔改师

欢迎转载，请支持原创，保留原文链接:blog.ilibrary.me

Excel里面可以插入图片。插入以后用代码读出来比较麻烦。

没有直接的解决方案。

一个间接的解决方案是把excel另存为html, 然后用nokogiri解析html, 获取图片引用.

Notes:

实战显示, 嵌套的frameset[1]/frame[1]下面嵌套的#document/html/body/table没法正常通过nokogiri xpath读出, chrome $x('//table')也不行. 通过css也不行。里面的#document节点可能是js动态写入的。
1. 读了一下htm的源码，确实是js动态写入的。最外层htm文件只是一个壳，它会另外加载*.fld/目录下的sheet001.htm文件,
解决办法是手动解析*.fld/下的sheet001.htm文件. 这个方法有2个问题，另存为的图片是压缩图片, 得不到高清图片。第2个问题是内嵌在htm里面的base64binary不知道该如何解码, 也得不到高清图片.
1. (有问题，图片是缩略图, 解决办法见step 2)sheet001.htm里面有<td> <v:shape o:gfxdata="xxxxx..." > <v:imagedata src="image060.png" o:title=""/>通过base64编码的图片内容，不用管。td 里面还嵌套了一个table, 里面有一个<td><img width=68 height=68 src=image060.png v:shapes="图片_x0020_62"></td>, 可以读取图片.
2. 缩略图的问题可以通过读取htm里面base64编码的原图片来解决. 问题又来了，这段base64编码的图片在一段comment里面, 读取起来有点痛苦.
  1. 先用td.xpath("./comment()[1]")获取comment的内容. 获取到comment以后还要去掉头尾到 标签, 可以用gsub!去除.
    1. 这是一段完整的读取图片的代码bnr1base64 = Nokogiri::XML(cols[index_map.with_indifferent_access["顶部图片1"]].xpath("./comment()[1]").text.gsub!("[if gte vml 1]>", "").gsub!("<![endif]", "")).children[0].attributes.filter{|k,v| v.name == "o:gfxdata"}.values.first.value
    2. 读取到base64以后就不知道该怎么解码了，这是base64binary, 不熟悉这个格式，尝试用工具解码为jpeg, png,都不能正常打开。
  2. https://www.docx4java.org/forums/docx-java-f6/how-to-read-an-image-bar-pie-chart-from-docx-file-t2096.html
另外一种办法是在2的基础上, 解压xlsx文件，等到高清图片. 从html获取到图片名称以后，把图片名称映射到解压目录里的图片上. 这个方式也有无法修复的bug，见下面:
1. 解压xlsx文件: 7z e test.xlsx
2. html里面获取的都是image001.png这种png格式。
3. xlsx解压后的图片是用户贴进去的格式，可能是png，也可能是jpeg，文件名格式是image1.png, image22.jpeg
4. html的图片名需要映射(查找)到解压后的文件名上去.
5. 致命Bug: 两者文件名末尾的数字不是一一对应的，会对歪。image055可能会对应到image54.jpeg上.
坑太深了，准备放弃ruby+nokogiri方案，用python搞一下看看.

放弃了纯ruby方案。用了python+ruby混合方案.方案如下:

先用python预处理, 抽取出所有的图片，并且按单元格名称命名, 保存为png格式(如果保存为jpeg格式，会抛透明通道无法保存的错误), 如”A1.png”, “Z1.png”等.
然后用ruby解析excel里面的文字内容，同时通过单元格位置去磁盘上查找指定的文件.

python code:

import openpyxl
from openpyxl_image_loader import SheetImageLoader
import os.path
import click

""""""
# extract_excel_images.py
# 抽取excel里面的图片, 按单元格名称保存图片，保存格式为png
""""""

@click.command()
@click.option('--path', help='Excel的路径')
@click.option('--sheet', default="商品表格", help='从哪个sheet抽取')
def hello(path, sheet):
   extract_image(path)

# '../doc/商品明细/飞天272/漫威56.xlsx'
def extract_image(file_path):
   # loading the Excel File and the sheet
   pxl_doc = openpyxl.load_workbook(file_path)
   sheet = pxl_doc.worksheets[0]

   # calling the image_loader
   image_loader = SheetImageLoader(sheet)

   # get/create image folder
   dir = os.path.dirname(file_path)
   image_dir = os.path.splitext(file_path)[0] # creaet folder with the same name as xlsx file, no extention.
   if not os.path.exists(image_dir):
      os.mkdir(image_dir)
   count = 0    
   # max row
   for row in range(1, sheet.max_row + 1):
      for col in ['O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y']:
            try:
               # get the image (put the cell you need instead of 'A1')
               image = image_loader.get("{}{}".format(col, row)) # "A1" like cell name
               print("Saving {}/{}{}.png".format( image_dir, col, row))
               image.save("{}/{}{}.png".format( image_dir, col, row)) #"xxx/A1.png",
               count += 1
               # can not be saved as jpeg, OSError: cannot write mode RGBA as JPEG
            except ValueError:
               print("图片{}{}不存在".format(col, row))
   print("总共找到{}图片.".format(count))

if __name__ == '__main__':
   hello()

下面一段是excel导出为htm文件后解析htm文件的代码，base64binary解码没有成功，去不到高清原图，没有啥实际意义.

# 读取sheet里面的内容
doc = Nokogiri::HTML(URI.open('./doc/data/bird.fld/sheet001.htm')) # 读取sheet的htm
rows = doc.xpath('//body/table/tr') # 所有的行
row = b[0] # 第一行
cols = row.xpath('./td') # 这一行所有的cell

# 读取base64binary image data
bnr1base64 = Nokogiri::XML(cols[index_map.with_indifferent_access["顶部图片1"]].xpath("./comment()[1]").text.gsub!("[if gte vml 1]>", "").gsub!("<![endif]", "")).children[0].attributes.filter{|k,v| v.name == "o:gfxdata"}.values.first.value

python

python可以用openpyxl和openpyxl-image-loader来实现

# installing the modules
pip3 install openpyxl
pip3 install openpyxl-image-loader

#Importing the modules
import openpyxl
from openpyxl_image_loader import SheetImageLoader

#loading the Excel File and the sheet
pxl_doc = openpyxl.load_workbook('myfile.xlsx')
sheet = pxl_doc['Sheet_name']

#calling the image_loader
image_loader = SheetImageLoader(sheet)

#get the image (put the cell you need instead of 'A1')
image = image_loader.get('A1')

#showing the image
image.show()

#saving the image
image.save('my_path/image_name.jpg')

</div

从excel里面读取图片

python

Recommend

小米高端化战略倍道而进：新一代折叠屏、自动驾驶技术、人形仿生机器人等重磅发布

小米816重磅手机钜惠手机低至1599元还有24期免息

重塑业务，Meta首次发行100亿美元债券

小米新品价格、开售时间汇总！八大新品最低799元

CHN Team AKing IOI22! Congrats!

隐私协议如何利用区块链解决数据安全和隐私问题？

任泽平：不要神话雷军和小米捧得越高可能摔得越惨

开放版权能否打破 NFT 市场死水？

天桥资本创始人：预计未来加密市场将更加活跃

「骨に自信」の石坂さんも驚き！骨密度低下の要因と、話題の対策とは

About Joyk