欢迎访问 生活随笔!

生活随笔

当前位置: 首页 >

构建一致性哈希ring Part2

发布时间:2023/12/29 47 豆豆
生活随笔 收集整理的这篇文章主要介绍了 构建一致性哈希ring Part2 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

******************转载请注明出处!**********

最后更新:2011年8月22日22:47:38


Part1中,已经构建好了一致性哈希ring的原型。

但存在一个问题。100个结点对应着1000个虚结点。结点变动时,虚结点和结点的对应关系会发生变化。当100个结点扩张到1001个时,会发生什么情况?

新增的结点数目会挤兑掉原先的结点数目!原因就在于1000个虚结点是固定的,不变化的。如果再扩容1000个虚结点->更改虚结点和实结点之间的对应关系->调整数据,这似乎又回到ring2_0.py的老套路。

这里做一些改动以更接近真实情况。

首先是vnode,以后改称为partition。因为partition数量很少会变动,所以需要充分估计到系统预期的规模。假如不会超过6000个结点,那么虚结点可以设置为实结点的100倍。这样,当虚结点需要调整的时候,最多只会影响到1%的数据。

在计算机中,数字取2的幂阶会有一些好处。比如除法只需要移相应的位就可以了。所以ring3_0.py中,结点数为65536(2^16),partition数为8388608(2^23)个。


ring3_0.py

from array import array from hashlib import md5 from struct import unpack_from from time import timePARTITION_POWER = 23 PARTITION_SHIFT = 32 - PARTITION_POWER NODE_COUNT = 65536 DATA_ID_COUNT = 100000000begin = time() part2node = array('H') for part in xrange(2 ** PARTITION_POWER):part2node.append(part % NODE_COUNT) node_counts = [0] * NODE_COUNT for data_id in xrange(DATA_ID_COUNT):data_id = str(data_id)part = unpack_from('>I',md5(str(data_id)).digest())[0] >> PARTITION_SHIFTnode_id = part2node[part]node_counts[node_id] += 1 desired_count = DATA_ID_COUNT / NODE_COUNTprint '%d: Desier data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d Least data ids on one node, %.02f%% under' % (min_count, under) print '%d seconds pass ...' % (time() - begin) 结果

1525: Desier data ids per node 1683 Most data ids on one node, 10.36% over 1360 Least data ids on one node, 10.82% under 234 seconds pass ...

说明:

代码比较简单,part2node是node和partition的对应关系,node_counts记录着每个node中有多少个数据映射进来。

gholt特意提到系统开销:2Byte存储16位对应结点id, 600多万个partition只对应占用12MB的内存。gholt还提到ring3_0.py结果出现10%波动的问题。主要是因为数据空间(10^8相对于partition数(约8*10^6)太小了。他尝试过更大的数据空间空间,比如10^9个数据,对应于8*10^6的partition,耗时6个小时。-_-!


ring3_0.py已经有点接近现实中一致性哈希ring了,ring4_0.py中将加入replica的特性。


ring4_0.py

from array import array from struct import unpack_from from hashlib import md5 from time import timeREPLICAS = 3 PARTITION_POWER = 16 PARTITION_SHIFT = 32 - PARTITION_POWER PARTITION_MAX = 2 ** PARTITION_POWER - 1 NODE_COUNT = 256 DATA_ID_COUNT = 10000000begin = time() part2node = array('H') for part in xrange(2 ** PARTITION_POWER):part2node.append(part % NODE_COUNT) #(3) node_counts = [0] * NODE_COUNT for data_id in xrange(DATA_ID_COUNT):data_id = str(data_id)part = unpack_from('>I',md5(str(data_id)).digest())[0] >> PARTITION_SHIFTnode_ids = [part2node[part]] #(1)node_counts[node_ids[0]] += 1for replica in xrange(1, REPLICAS):while part2node[part] in node_ids: #(2)part += 1if part > PARTITION_MAX:part = 0node_ids.append(part2node[part])node_counts[node_ids[-1]] += 1 desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS print'%d: Desired data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print'%d: Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print'%d: Least data ids on one node,%.02f%% under' % (min_count, under) print'%d seconds pass ...' % (time() - begin) 结果:

117186: Desired data ids per node 118133: Most data ids on one node, 0.81% over 116093: Least data ids on one node,0.93% under 52 seconds pass ...

说明:

#(1) node_ids记录的是3个replica存放的node id。part2node[part]是根据对应的partition id 找到对应的node id。

#(2)处循环3次,依次为数据的replica安排连续的partition。之后改变相应的记录,node_ids.和node_counts。

ring4_0.py看起来还不错,1%的波动。可是仍然会有两个问题:

1. #(2)处是分配连续的partition。#(3)处初始化时是有规律的。这会在一些情况下使部分数据表现很糟糕。比如这部分数据被映射到node x的partition N上,node x频繁宕机,而总是partition N上的数据需要进行复制。解决也相对容易: part2node初始化时进行随机打乱,partition N和node x不在存在某种联系。

2 数据安全的问题。假如replica所在的partiton都分布在同一个机架上。机架掉电,这会导致所有replica都不可用。因此需要一种机制对故障进行隔离。因此也就引入了zone的概念。


ring4_1.py

from array import array from struct import unpack_from from hashlib import md5 from time import time from random import shuffleREPLICAS = 3 PARTITION_POWER = 16 PARTITION_SHIFT = 32 - 16 PARTITION_MAX = 2 ** PARTITION_POWER - 1 NODE_COUNT = 256 ZONE_COUNT = 16 DATA_ID_COUNT = 10000000begin = time() node2zone = [] while len(node2zone) < NODE_COUNT: #(1)zone = 0while zone < ZONE_COUNT and len(node2zone) < NODE_COUNT:node2zone.append(zone)zone += 1 part2node = array('H') for part in xrange(2 ** PARTITION_POWER):part2node.append(part % NODE_COUNT) shuffle(part2node) #(2) node_counts = [0] * NODE_COUNT zone_counts = [0] * ZONE_COUNT for data_id in xrange(DATA_ID_COUNT):data_id = str(data_id)part = unpack_from('>I',md5(str(data_id)).digest())[0] >> PARTITION_SHIFTnode_ids = [part2node[part]]zones = [node2zone[node_ids[0]]]node_counts[node_ids[0]] += 1zone_counts[zones[0]] += 1for replica in xrange(1, REPLICAS): #(3)while part2node[part] in node_ids and \node2zone[part2node[part]] in zones:part += 1if part > PARTITION_MAX:part = 0node_ids.append(part2node[part])zones.append(node2zone[node_ids[-1]])node_counts[node_ids[-1]] += 1zone_counts[zones[-1]] += 1desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS print '%d Desired data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d: Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d: Least data ids on one node,%.02f%% under' % (min_count, under)desired_count = DATA_ID_COUNT / ZONE_COUNT * REPLICAS print'%d: Desired data ids per zone' % desired_count max_count = max(zone_counts) over = 100.0 * (max_count - desired_count) / desired_count print'%d: Most data ids in one zone, %.02f %% over' % (max_count, over) min_count = min(zone_counts) under = 100.0 * (desired_count - min_count) / desired_count print'%d: Least data ids in one zone, %.02f %%under' % (min_count, under) print '%d seconds pass ...' % (time() - begin) 结果:

117186 Desired data ids per node 118619: Most data ids on one node, 1.22% over 115810: Least data ids on one node,1.17% under 1875000: Desired data ids per zone 1879143: Most data ids in one zone, 0.22 % over 1871851: Least data ids in one zone, 0.17 %under 69 seconds pass ...

说明:

#(1)zonelist初始化。引入zone,以解决ring4_0.py问题2。

#(2) part2node洗牌,打乱顺序。解决ring4_0.py问题1。花的时间比较多。

#(3) 逐次探查partition位置是否合适,不能在同一个node上,也不能在同一个zone上。

 


到此,ring的基本功能都已经有了:一致性哈希ring,replica,zone。ring4_2.py给出类似图2.1的实现。当然,只是一种实现模型。swift曾经采用过这种模型,但现已废弃。这里权当扩展了解,不感兴趣的请跳到下一Part。Part3将把上述的特性封装成类,加入weight,并简单测试。

 

ring4_2.py体现的是一种anchor思想。即:维护一个node的anchor环。每次data都会查找anchor环找到和匹配自己的存储node位置。

anchor环就是ring4_2.py中的hash2index和index2node。每个数据都在#(2)处查找index,之后到index2node找对应的node。#(1)中的二分查找、hash计算都是比较耗时的。系统开销暂不提,这种实现并不能够均匀分布数据。为使数据分布的更为均匀,每个node都要维护anchor,不断遍历anchor环以查找合适位置。而且,这种情况下replica的管理会非常麻烦。

空间和时间,计算机中的博弈。ring4_1.py采用的空间换时间,简化了对应关系;而ring4_2.py采用了时间换空间,思路简单,但效果不尽人意。


ring4_2.py

from bisect import bisect_left from hashlib import md5 from struct import unpack_from from time import timeREPLICAS = 3 NODE_COUNT = 256 ZONE_COUNT = 16 DATA_ID_COUNT = 10000000 VNODE_COUNT = 100begin = time() node2zone = [] while len(node2zone) < NODE_COUNT:zone = 0while zone < ZONE_COUNT and len(node2zone) < NODE_COUNT:node2zone.append(zone)zone += 1 hash2index = [] index2node = [] for node in xrange(NODE_COUNT):for vnode in xrange(VNODE_COUNT):hsh = unpack_from('>I', md5(str(node)).digest())[0]index = bisect_left(hash2index, hsh)if index > len(hash2index):index = 0hash2index.insert(index, hsh)index2node.insert(index, node) node_counts = [0] * NODE_COUNT zone_counts = [0] * ZONE_COUNT for data_id in xrange(DATA_ID_COUNT): #(1)data_id = str(data_id)hsh = unpack_from('>I', md5(str(data_id)).digest())[0]index = bisect_left(hash2index, hsh) #(2)if index >= len(hash2index):index = 0node_ids = [index2node[index]]zones = [node2zone[node_ids[0]]]node_counts[node_ids[0]] += 1zone_counts[zones[0]] += 1for replica in xrange(1, REPLICAS):while index2node[index] in node_ids and node2zone[index2node[index]] in zones:index += 1if index >= len(hash2index):index = 0node_ids.append(index2node[index])zones.append(node2zone[node_ids[-1]])node_counts[node_ids[-1]] += 1zone_counts[zones[-1]] += 1 desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS print '%d: Desired data ids per node' % desired_count max_count = max(node_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d: Most data ids on one node, %.02f%% over' % (max_count, over) min_count = min(node_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d: Least data ids on one node, %.02f%% under' % (min_count, under)desired_count = DATA_ID_COUNT / ZONE_COUNT * REPLICAS print '%d: Desired data ids per zone' % desired_count max_count = max(zone_counts) over = 100.0 * (max_count - desired_count) / desired_count print '%d: Most data ids on one zone, %.02f%% over' % (max_count, over) min_count = min(zone_counts) under = 100.0 * (desired_count - min_count) / desired_count print '%d: Least data ids on one zone, %.02f%% under' % (min_count, under)print '%d seconds pass ...' % (time() - begin) 结果:

117186: Desired data ids per node 351282: Most data ids on one node, 199.76% over 15965: Least data ids on one node, 86.38% under 1875000: Desired data ids per zone 2248496: Most data ids on one zone, 19.92% over 1378013: Least data ids on one zone, 26.51% under 990 seconds pass ...


总结

以上是生活随笔为你收集整理的构建一致性哈希ring Part2的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。