poolboy max_overflow 引發的血案
問題
這是個線上問題. 某個服務節點在較低的qps(每秒2000次資料庫訪問)下, 在worker程序數100, max_overflow程序數100的情況下. 突然效能下降, 每秒只能處理1500次資料庫訪問. 導致請求處理延時從幾MS上升至幾百MS, 之後又逐漸恢復.
原因
逐漸把範圍縮小至 mongodb poolboy 程序池的 checkout:
check out
handle_call({checkout, CRef, Block}, {FromPid, _} = From, State) -> #state{supervisor = Sup, workers = Workers, monitors = Monitors, overflow = Overflow, max_overflow = MaxOverflow} = State, case Workers of [Pid | Left] -> MRef = erlang:monitor(process, FromPid), true = ets:insert(Monitors, {Pid, CRef, MRef}), {reply, Pid, State#state{workers = Left}}; [] when MaxOverflow > 0, Overflow < MaxOverflow -> {Pid, MRef} = new_worker(Sup, FromPid), true = ets:insert(Monitors, {Pid, CRef, MRef}), {reply, Pid, State#state{overflow = Overflow + 1}}; [] when Block =:= false -> {reply, full, State}; [] -> MRef = erlang:monitor(process, FromPid), Waiting = queue:in({From, CRef, MRef}, State#state.waiting), {noreply, State#state{waiting = Waiting}} end;
可以看到, 當max_overflow不為0時, 瞬間過載會建立新的worker, 而這些worker, 都會去連結mongodb, 耗時1-2MS. 建立的消耗會阻塞master process.
check in
而歸還時, 又會將worker銷燬, 導致連結一直建立/銷燬, 而且都卡在master process, 這導致所有的請求, 都會因master process的連結建立和銷燬而阻塞, 導致qps雪崩下降.
handle_checkin(Pid, State) -> #state{supervisor = Sup, waiting = Waiting, monitors = Monitors, overflow = Overflow, strategy = Strategy} = State, case queue:out(Waiting) of {{value, {From, CRef, MRef}}, Left} -> true = ets:insert(Monitors, {Pid, CRef, MRef}), gen_server:reply(From, Pid), State#state{waiting = Left}; {empty, Empty} when Overflow > 0 -> ok = dismiss_worker(Sup, Pid), State#state{waiting = Empty, overflow = Overflow - 1}; {empty, Empty} -> Workers = case Strategy of lifo -> [Pid | State#state.workers]; fifo -> State#state.workers ++ [Pid] end, State#state{workers = Workers, waiting = Empty, overflow = 0} end.
結論
不要使用 poolboy 的 max_overflow, 若建立/銷燬 children process時有一定消耗, 很容易阻塞 poolboy master程序, 頻繁建立/銷燬 worker 導致雪崩.
每次查BUG, 回頭看來都是理所當然. 追查時卻要費一番心思, 監控資料不便在個人blog給出. 不免省掉很多推斷過程, 希望這個結論對大家有幫助.