一種快速在向量空間中尋找k緊鄰的演算法——annoy index

阿新 • • 發佈：2019-01-21

幾個需要關注的點：

1.這是一個精確度換速度的演算法，找到的k緊鄰不能保證是全域性的k緊鄰（例如在分割平面附近的點），所以如果要找exact的k緊鄰的話並不合適，還是得做全域性的搜尋2.可以通過設定tree的數量來balance精度和速度3.每次對同一份資料建立索引是不同的，所以兩次計算結果可能也會不同4.github：https://github.com/spotify/annoy

最近工作中使用了一下annoy，於是抽時間看了下程式碼，記錄下。。

annoy支援三種距離度量方式，cos距離，歐式距離和曼哈頓距離。下面主要通過最簡單的歐氏距離來看。

首先看下節點node的結構

n_descendants記錄了該節點下子節點的個數，children[2]記錄了左右子樹，v和a之後會詳細說，先知道v[1]代表該節點對應的向量，a代表偏移就好。

然後看下AnnoyIndex類

_n_items記錄了我們一共有多少個向量需要構建索引，_n_nodes記錄了一共有多少個節點，_s是node佔有的空間大小，_f是向量的維度，_nodes所有節點，_roots是所有樹的根節點。

annoy建樹的時候當該區域內的節點數小於k的時候就不會再繼續遞迴建樹，之前疑惑怎麼調整k這個引數，看完程式碼才發現沒法調整，_K是一個定值，如果一個區域內的節點數小於_K的時候，這個節點就不再記錄向量v，v的空間也用來記錄節點的id。

另外還有一個比較奇怪的事情就是annoy為node開闢空間的方式。。比如我有三個item，建索引的時候id分別為3,6,10，那麼annoy會開闢11個node空間，從0-10。。看下面這段程式碼就能明白

再接下來就是到了建樹。annoy建樹如下圖，每次選擇空間中的兩個質心作為分割點，相當於kmeans過程，以使得兩棵子樹分割的儘量均勻以保證logn的檢索複雜度。以垂直於過兩點的直線的超平面來分割整個空間，然後在兩個子空間內遞迴分割直到子空間最多隻有k個點。如下圖

然後看下建立分割面的過程，入參為當前空間的所有點nodes，維度f，隨機函式random，分割節點n

best_iv和best_jv就是選出來的那兩個點，n->v儲存的就是這兩個點連線對應的向量，即分隔面的法向量，計算方式就是兩點對應向量相減。n->a儲存的就是分割超平面對應的偏移，以三維空間舉例，三維空間中的平面表示方法為Ax + By + Cz + D = 0，n->a儲存的就是這個D，計算方法如下，因為平面的法向量已經確定，又因為該平面過best_iv和best_jv連線中點，將中點座標代入，連線中心點定義為m=((best_iv[0] + best_jv[0])/2, (best_iv[1] + best_jv[1])/2, (best_iv[2] +best_jv[2])/2)，則A * m[0] + B * m[1] + C * m[2] + D =0 => D= -(A * m[0] +B * m[1] + C * m[2])。

接下來看一下是如何選擇兩個點的，即two_means

為了保證nlogn的檢索複雜度，需要使得每次分割得到的兩棵子樹儘量平衡，所以要找空間中的兩個質心，過程很像kmeans，初始隨機選取兩個點，每次迭代過程中隨機選擇一個點計算該點屬於哪個子樹，並更新對應的質心座標。

建樹完成之後就是檢索，對於給定的點去樹中找topk近鄰，最基本的想法就是從根開始，根據該點的向量資訊和每個樹節點的分割超平面比較決定去哪個子樹遍歷。如圖所示

但是這樣還是存在一些問題，就是最近鄰不一定會和查詢點在同一個葉結點上

解決方法是這樣的，一是建立多棵樹，二是在查詢點遍歷樹的時候不一定只選擇一條路徑，這兩個方法對應兩個引數treenum和searchnum，如圖所示

遍歷過程中用優先佇列維護候選集合，將所有樹的結果去重維護到優先佇列中，最後對這些候選集合計算距離並返回topk

首先看兩個小函式，這是在樹上遍歷的時候計算點到超平面的距離並確定在哪棵子樹的函式，點到平面的距離函式為

因為每個樹節點中的超平面向量已經被歸一化到1了，所以只需計算分子即可。

然後看下檢索函式，

nns是候選集合，search_k即前面提到的search_num

首先是將所有樹的根節點壓入優先佇列中，每次取出優先佇列的頭結點進行遍歷，如果頭結點為葉結點，則將該樹節點對應的所有點加入到nns中，如果是非葉結點，則將兩棵子樹都加入到優先佇列中，以此迴圈遍歷直到nns中節點超過searchnum，最後對nns中的id去重後計算距離返回。

最後說下遇到的另一個問題，我們度量距離的方式為向量內積，內積並不能使用lsh方法來計算最近鄰，為了解決這個問題，我們將內積距離轉換為了cos距離，具體做法為在建立索引時，將所有向量的每一維除以c，c設定為所有向量中最大的模長，並將所有向量增加一維，設定為1減去其他維的平方的和再開根。這樣就可以把cos距離中的分母消去了索引向量模長，而檢索向量的模長並不會影響排序。

參考：

http://blog.csdn.net/hero_fantao/article/details/70245387

annoy作者的ppt

Annoy

https://ci.appveyor.com/api/projects/status/github/spotify/annoy?svg=true&pendingText=windows%20-%20Pending&passingText=windows%20-%20OK&failingText=windows%20-%20Failing

https://img.shields.io/pypi/v/annoy.svg?style=flat

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Install

To install, simply do sudo pip install annoy to pull down the latest version from PyPI.

For the C++ version, just clone the repo and #include "annoylib.h".

Background

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need the RAM to fit the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Summary of features

Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.6, 2.7, 3.3, 3.4, 3.5

Python code example

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f)  # Length of item vector that will be indexed
for i in xrange(1000):
    v = [random.gauss(0, 1) for z in xrange(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

Right now it only accepts integers as identifiers for items. Note that it will allocate memory for max(id)+1 items because it assumes your items are numbered 0 … n-1. If you need other id's, you will have to keep track of a map yourself.

Full Python API

AnnoyIndex(f, metric='angular') returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", or "hamming".
a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.
a.build(n_trees) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added.
a.save(fn) saves the index to disk.
a.load(fn) loads (mmaps) an index from disk.
a.unload() unloads.
a.get_nns_by_item(i, n, search_k=-1, include_distances=False) returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.
a.get_nns_by_vector(v, n, search_k=-1, include_distances=False) same but query by vector v.
a.get_item_vector(i) returns the vector for item i that was previously added.
a.get_distance(i, j) returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.
a.get_n_items() returns the number of items in the index.

Notes:

There's no bounds checking performed on the values so be careful.
Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to sqrt(2(1-cos(u,v)))

The C++ API is very similar: just #include "annoylib.h" to get access to it.

Tradeoffs

There are just two parameters you can use to tune Annoy: the number of trees n_trees and the number of nodes to inspect during searching search_k.

n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.
search_k is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.

If search_k is not provided, it will default to n * n_trees where n is the number of approximate nearest neighbors. Otherwise, search_k and n_trees are roughly independent, i.e. a the value of n_trees will not affect search time if search_k is held constant and vice versa. Basically it's recommended to set n_trees as large as possible given the amount of memory you can afford, and it's recommended to set search_k as large as possible given the time constraints you have for the queries.

How does it work

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

Hamming distance (contributed by Martin Aumüller) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.

More info

Andy Sloane provides a Java version of Annoy although currently limited to cosine and read-only.
Pishen Tsai provides a Scala wrapper of Annoy which uses JNA to call the C++ library of Annoy.
During part of Spotify Hack Week 2016 (and a bit afterward), Jim Kang wrote Node bindings for Annoy.
Radim Řehůřek's blog posts comparing Annoy to a couple of other similar Python libraries: Intro, Contestants, Querying
ann-benchmarks is a benchmark for several approximate nearest neighbor libraries. Annoy seems to be fairly competitive, especially at higher precisions:

Source code

It's all written in C++ with a handful of ugly optimizations for performance and memory usage. You have been warned :)

The code should support Windows, thanks to Qiang Kou and Timothy Riley.

To run the tests, execute python setup.py nosetests. The test suite includes a big real world dataset that is downloaded from the internet, so it will take a few minutes to execute.

Discuss

Feel free to post any questions or comments to the annoy-user group. I'm @fulhack on Twitter.

一種快速在向量空間中尋找k緊鄰的演算法——annoy index

幾個需要關注的點：

Annoy

Install

Background

Summary of features

Python code example

Full Python API

Tradeoffs

How does it work

More info

Source code

Discuss

一種快速在向量空間中尋找k緊鄰的演算法——annoy index

Spring的2.5版本中提供了一種:p名稱空間的注入（瞭解）

遊戲編程精粹學習 - 一種快速的圓柱棱臺相交測試算法

一種快速卷積實現方法

（LeetCode每日一刷29）陣列中的K-diff數對

VS程式設計，一種快速管理程式碼段的工具。

一種從Robotstudio環境中匯出機器人模型並在MATLAB下使其視覺化的研究記錄

FINN（一）簡介一種快速，可擴充套件的二值化神經網路框架

Zen Coding: 一種快速編寫HTML/CSS程式碼的方法

MLlib中的K-means演算法（一）

Python的另一種開發環境--Anaconda中的Spyder

機器學習：向量空間中的投影

一種快速簡單而又有效的低照度影象恢復演算法

一種快速將markdown轉換為html的方法

一種簡單的Android 中全域性更換字型的方法

一種快速對集合遍歷返回方法

一種快速簡便優秀的全域性曲線調整與區域性資訊想結合的非線性彩色增強演算法（多圖深度分析和探索）

實戰:一種在http請求中使用protobuffer+nginx+lua收集打點日誌的方案

Python中的k—近鄰演算法（處理常見的分類問題）

一種基於二叉樹的int32排序演算法

一種快速在向量空間中尋找k緊鄰的演算法——annoy index

幾個需要關注的點：

Annoy

Install

Background

Summary of features

Python code example

Full Python API

Tradeoffs

How does it work

More info

Source code

Discuss

相關推薦