![](/style/images/good.png)
![](/style/images/bad.png)
从 10 亿位数字里查找指定的数字,怎样才能快一些?
source link: https://www.v2ex.com/t/814478
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
从网上下一个 950M 的 txt 文件,里面保存的是圆周率小数点后的 10 亿位数字。想使用 python 查找某个指定的 6 位或 8 位数字在其中的位置,现在直接读文件后用 str.find()查找实在太慢了,请教各位有什么比较快的办法吗?
文件下载地址: https://stuff.mit.edu/afs/sipb/contrib/pi/pi-billion.txt
Junzhou 14 小时 23 分钟前
cclin 14 小时 20 分钟前 via Android
SmiteChow 14 小时 17 分钟前
djFFFFF 13 小时 22 分钟前
@hidemyself cpython 我印象里 str.find() 是用的 BMH 算法?反正虽然这个题面是个标准的 KMP 算法的场景,现实生产环境谁用谁是傻子。
yianing 11 小时 43 分钟前
![stats]( https://imgur.com/GjcslkB)
lesismal 8 小时 25 分钟前
1. converter.py
```python
# -*- coding:utf-8 -*-
#!/usr/bin/python3
import datetime
class PIConverter:
def __init__(self, minNum=100000, maxNum=99999999):
self.minNum = minNum
self.maxNum = maxNum
self.positions = [0]*(self.maxNum+1-self.minNum)
def convert(self, srcFile, dstFile):
fsrc = open(srcFile,'r')
fsrc.read(2)
try:
lastStr = ""
readSize = 1024*8
currPos = 0
readed = 0
starttime = datetime.datetime.now()
offset = len(str(self.minNum)) - 1
while True:
s = fsrc.read(readSize)
s = lastStr + s # 这里可以再优化下
currPos -= len(lastStr)
for i in range(len(s)-8):
strLen = len(str(self.minNum))
while strLen <= len(str(self.maxNum)):
subs = s[i:i+strLen]
strLen += 1
num = int(subs)
index = num - self.minNum
if self.positions[index] == 0:
self.positions[index] = currPos + i
if len(s) == 0:
break
lastStr = s[len(s)-5:]
currPos += readSize
readed += readSize
if readed % (1024*1024*8) == 0:
print("total read: {}, time used: {}s".format(readed, (datetime.datetime.now() - starttime).seconds))
print("total read: {}, time used: {}s".format(readed, (datetime.datetime.now() - starttime).seconds))
print("done")
try:
fdst = open(dstFile,'rw+')
for index in range(self.positions):
fdst.write(str(index)+"\n")
finally:
fdst.close()
finally:
fsrc.close()
def find(self, n):
if n < self.minNum or n > 99999999:
return -1
return self.positions[n - self.minNum]
piConverter = PIConverter()
# 把已经统计出来的生成更小的文件
piConverter.convert("./pi-billion.txt", "./pi-position.txt")
# converter 初始化太慢了,所以最好还是先 piConverter.convert 把已经统计出来的生成更小的文件,finder.py 用该文件初始化和做查找
# print("141592:", piConverter.find(141592))
# print("415926:", piConverter.find(415926))
```
2. finder.py
```python
# -*- coding:utf-8 -*-
#!/usr/bin/python3
class PIFinder:
def __init__(self, fname, minNum=100000, maxNum=99999999):
self.minNum = minNum
self.maxNum = maxNum
self.positions = [0]*(self.maxNum+1-self.minNum)
f = open(fname,'r')
try:
i = 0
for line in f:
num = int(line)
self.positions[i] = num
finally:
f.close()
def find(self, n):
if n < self.minNum or n > 99999999:
return -1
return self.positions[n - self.minNum]
piFinder = PIFinder("./pi-position.txt")
print("141592:", piFinder.find(141592))
print("415926:", piFinder.find(415926))
```
c0xt30a 6 小时 8 分钟前
1. 将 0-9 映射为 前 10 个素数 [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
2. 用一个定长为 6/8 的滑动窗口遍历这个 pi 的字符串,每次增长的时候,当前的 hash 先除以最后一位数字对应的素数再乘以新增数字对应的素数,可以得到最新的 hash 数值
3. 如果当前 hash 数值与要寻找的数字的 hash 相等,则停下来进一步比对字符串
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK