最近搞一些 Entity Linking 相關的事情看了看下 yahoo 的這兩篇工作和開源的(FEL)[https://github.com/yahoo/FEL]

Fast and Space-Efficient Entity Linking in Queries

ABSTRACT

Entity Linking一般需要在在下游的檢索之前完成，typically within milliseconds.
In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base.
演算法的快速和高效主要得益於下面三點：

忽略不同實體之間的依賴關係
使用 hashing and compression techniques 減少記憶體佔用
為了使演算法具有上下文知識而不犧牲速度，我們將查詢詞和實體的分佈語義之間的距離考慮到模型中。

INTRODUCTION

Linking free text to entities typically comprises three steps:

識別 candidate mentions, i.e., which part(s) of the text to link
識別 candidate entities for each mention

基於一些背景和一致性的概念來消除候選實體的歧義

MODELING ENTITY LINKING

我們的模型通過錨文字或者使用者的點選行為建立了從 entities 和 entities 的別名的連線（本文只考慮維基百科中的錨文字以及維基百科結果上的網路搜尋結果中的點選）。
解決的問題：

自動分割查詢
為每個分割選擇正確的實體

我們的快速實體連結器（Fast Entity Linker, FEL）通過計算每個分段-實體對的概率分數，然後優化整個查詢的分數來解決此問題。不採用任何監督，讓模型和資料以無引數方式執行；也可以新增利用標註的訓練資料的附加層，增強模型的效能。

Fast Entity Linker

定義符號：

$S\times E$ ：an event space where $S$ is the set of all sequences and $E$ the set of all entities known to the system.
$s$ ：一系列的term
$\hat{s}$ ：一系列的 $s$ ，分詞結果
$\hat{e}$ ：一系列的 $e$ ，實體集合
$a_s$ ：indicates if $s$ is an alias
$a_{s,e}$ ：$indicates if $s$ is an alias pointing (linking/clicked) to $e$ .
$c$ : indicates which collection acts as a source of information query log or Wikipedia ( $c_q$ or $c_w$ )
$n(s, c)$ is the count of $s$ in $c$
$n(e, c)$ is the count of $e$ in $c$

令 $q$ 為輸入查詢， $S_q=\{t_1,…,t_k\}$ 為所有可能的分詞集合。該演算法將返回實體 $e$ 的集合,通過最大化
$argmax_{\hat{e}\in E}logP(\hat{e}|q)=argmax_{\hat{e}\in E,\hat{s}\in S_q}\sum_{e\in \hat{e},s\in \hat{s}}\sum_{e\in \hat{e},s\in \hat{s}}logP(e|s)$

意思是：給定查詢 $q$ ，先得到所有可能的分詞結果的集合 $S_q=\{t_1,…,t_k\}$ ，每個分詞結果找到每個分詞對應的最有可能的實體的概率，最大化整體的概率。
上式假設 entities 之間相互獨立。Each individual entity/segment probability is then estimated as:
$P(e|s)=\sum_{c\in \{c_q,c_w\}}P(c|s)P(e|c,s)=\sum_{c\in \{c_q,c_w\}}P(c|s)[P(a_s=0|a_s=0,c,s)P(e|c,s)+P(a_s=1|c,s)P(e|a_s=1,c,s)]$

意思是：每個分詞 $s$ 對應實體 $e$ 的概率由兩部分資訊組成，一部分來源於 query log $c_q$ ，一部分來源於Wikipedia $c_w$ ；對於每種來源， $s$ 有兩種可能性，一種是 $s$ 是一個 $e$ 的別名，一種是 $s$ 不是 $e$ 的一種別名。
顯然 $P(e|a_s=0, c, s)=0$

【閱讀筆記】Entity Linking 相關

Fast and Space-Efficient Entity Linking in Queries

ABSTRACT

INTRODUCTION

MODELING ENTITY LINKING

Fast Entity Linker

【閱讀筆記】Entity Linking 相關

【閱讀筆記】《C程序員從校園到職場》第二章學校到職場

【閱讀筆記】《C程序員從校園到職場》第三章程序的樣式（大括號）

【閱讀筆記】《C程序員從校園到職場》第六章配置文件，makefile 文件 (Part 2)

【閱讀筆記】Ranking Relevance in Yahoo Search

【閱讀筆記】Real-time Personalization using Embeddings for Search Ranking at Airbnb

【閱讀筆記】Applying Deep Learning To Airbnb Search

【閱讀筆記】Dynamical time series analytics

【閱讀筆記】Detection of time delays and directional interactions

【閱讀筆記】《Panoptic Segmentation》

【閱讀筆記】移動APP測試實戰--第一章

【閱讀筆記】JavaScript 高階程式設計（四）

【讀書筆記】閱讀的危險

【Java並發編程實戰-閱讀筆記】02-對象的共享

【筆記】Android Property 相關

【論文閱讀筆記】Deep Learning based Recommender System: A Survey and New Perspectives

【論文閱讀筆記】MULTI-SCALE DENSE NETWORKS FOR RESOURCE EFFICIENT IMAGE CLASSIFICATION

【論文閱讀筆記】---二值神經網路（BNN）

雲管理服務相關知識【學習筆記】......實時更新

【論文閱讀筆記】Embedding Electronic Health Records for Clinical Information Retrieval

【閱讀筆記】Entity Linking 相關

Fast and Space-Efficient Entity Linking in Queries

ABSTRACT

INTRODUCTION

MODELING ENTITY LINKING

Fast Entity Linker

相關推薦