Ceph Bluestore RocksDB Analyse
對於Ceph全新的儲存引擎BlueStore來說,RocksDB的意義很大,它儲存了BlueStore相關的元資料資訊,對它的理解有助於更好的理解BlueStore的實現,分析之後遇到的問題;
BlueStore架構
BlueStore的架構圖如下,還是被廣泛使用的一張:
如上圖所示,BlueStore的幾個關鍵元件中,RocksDB對接了BlueStore的metadata資訊,本文拋開別的元件,詳細描述RocksDB在這裡儲存的資訊以及其實現;
BlueStore結構體定義
Ceph裡BlueStore的定義和主要資料成員如下:
class BlueStore : public ObjectStore, public md_config_obs_t { ... private: BlueFS *bluefs = nullptr; unsigned bluefs_shared_bdev = 0;///< which bluefs bdev we are sharing bool bluefs_single_shared_device = true; utime_t bluefs_last_balance; KeyValueDB *db = nullptr; BlockDevice *bdev = nullptr; std::string freelist_type; FreelistManager *fm = nullptr; Allocator *alloc = nullptr; uuid_d fsid; int path_fd = -1;///< open handle to $path int fsid_fd = -1;///< open handle (locked) to $path/fsid bool mounted = false; vector<Cache*> cache_shards; std::mutex osr_lock;///< protect osd_set std::set<OpSequencerRef> osr_set; ///< set of all OpSequencers ... };
幾個關鍵的資料成員如下:
1) BlueFS
定義: BlueFS *bluefs = nullptr;
支援RocksDB的定製FS,只實現了RocksEnv需要的API介面;
程式碼裡在_open_db()裡對其初始化:
int BlueStore::_open_db(bool create) { rocksdb::Env *env = NULL; if (do_bluefs) { bluefs = new BlueFS(cct); } }
2) RocksDB
定義: KeyValueDB *db = nullptr;
在BlueStore的元資料和OMap都通過DB儲存,這裡使用的是RocksDB,它的初始化也是在_open_db()函式中:
int BlueStore::_open_db(bool create) { // 獲取kv的後端裝置 string kv_backend; if (create) { kv_backend = cct->_conf->bluestore_kvbackend; } else { r = read_meta("kv_backend", &kv_backend); } // mkfs也會呼叫這裡,create時候根據配置做bluefs的建立 if (create) { do_bluefs = cct->_conf->bluestore_bluefs; } else { string s; r = read_meta("bluefs", &s); } rocksdb::Env *env = NULL; // 建立bluefs if (do_bluefs) { bluefs = new BlueFS(cct); bfn = path + "/block.db"; if (::stat(bfn.c_str(), &st) == 0) { r = bluefs->add_block_device(BlueFS::BDEV_DB, bfn); if (bluefs->bdev_support_label(BlueFS::BDEV_DB)) { r = _check_or_set_bdev_label( bfn, bluefs->get_block_device_size(BlueFS::BDEV_DB), "bluefs db", create); } if (create) { bluefs->add_block_extent( BlueFS::BDEV_DB, SUPER_RESERVED, bluefs->get_block_device_size(BlueFS::BDEV_DB) - SUPER_RESERVED); } bluefs_shared_bdev = BlueFS::BDEV_SLOW; bluefs_single_shared_device = false; } else { if (::lstat(bfn.c_str(), &st) == -1) { bluefs_shared_bdev = BlueFS::BDEV_DB; } } // shared device bfn = path + "/block"; r = bluefs->add_block_device(bluefs_shared_bdev, bfn); bfn = path + "/block.wal"; if (::stat(bfn.c_str(), &st) == 0) { r = bluefs->add_block_device(BlueFS::BDEV_WAL, bfn); if (bluefs->bdev_support_label(BlueFS::BDEV_WAL)) { r = _check_or_set_bdev_label( bfn, bluefs->get_block_device_size(BlueFS::BDEV_WAL), "bluefs wal", create); } if (create) { bluefs->add_block_extent( BlueFS::BDEV_WAL, BDEV_LABEL_BLOCK_SIZE, bluefs->get_block_device_size(BlueFS::BDEV_WAL) - BDEV_LABEL_BLOCK_SIZE); } cct->_conf->set_val("rocksdb_separate_wal_dir", "true"); bluefs_single_shared_device = false; } } // 建立RocksDB db = KeyValueDB::create(cct, kv_backend, fn, static_cast<void*>(env)); FreelistManager::setup_merge_operators(db); db->set_merge_operator(PREFIX_STAT, merge_op); db->set_cache_size(cache_size * cache_kv_ratio); if (kv_backend == "rocksdb") options = cct->_conf->bluestore_rocksdb_options; db->init(options); if (create) r = db->create_and_open(err); else r = db->open(err); }
3) BlockDevice
定義: BlockDevice *bdev = nullptr;
底層儲存BlueStore Data / db / wal的塊裝置,有如下幾種:
- KernelDevice
- NVMEDevice
- PMEMDevice
程式碼中對其初始化如下:
int BlueStore::_open_bdev(bool create) { string p = path + "/block"; bdev = BlockDevice::create(cct, p, aio_cb, static_cast<void*>(this)); int r = bdev->open(p); if (bdev->supported_bdev_label()) { r = _check_or_set_bdev_label(p, bdev->get_size(), "main", create); } // initialize global block parameters block_size = bdev->get_block_size(); block_mask = ~(block_size - 1); block_size_order = ctz(block_size); r = _set_cache_sizes(); return 0; }
4) FreelistManager
定義: FreelistManager *fm = nullptr;
管理BlueStore裡空閒blob的;
預設使用的是:BitmapFreelistManager
int BlueStore::_open_fm(bool create){ fm = FreelistManager::create(cct, freelist_type, db, PREFIX_ALLOC); int r = fm->init(bdev->get_size()); }
5) Allocator
定義: Allocator *alloc = nullptr;
BlueStore的blob分配器,支援如下幾種:
- BitmapAllocator
- StupidAllocator
預設使用的是 StupidAllocator;
6) 總結:BlueStore的mount過程
在BlueStore的 mount過程中,會呼叫各個函式來初始化其使用的各個元件,順序如下:
int BlueStore::_mount(bool kv_only) { int r = read_meta("type", &type); if (type != "bluestore") { return -EIO; } ... int r = _open_path(); r = _open_fsid(false); r = _read_fsid(&fsid); r = _lock_fsid(); r = _open_bdev(false); r = _open_db(false); if (kv_only) return 0; r = _open_super_meta(); r = _open_fm(false); r = _open_alloc(); r = _open_collections(); r = _reload_logger(); if (bluefs) { r = _reconcile_bluefs_freespace(); } _kv_start(); r = _deferred_replay(); mempool_thread.init(); mounted = true; return 0; }
RocksDB的定義
RocksDB的定義如下,基於KeyValueDB實現介面:
/** * Uses RocksDB to implement the KeyValueDB interface */ class RocksDBStore : public KeyValueDB { ... string path; void *priv; rocksdb::DB *db; rocksdb::Env *env; std::shared_ptr<rocksdb::Statistics> dbstats; rocksdb::BlockBasedTableOptions bbt_opts; string options_str; uint64_t cache_size = 0; ... // manage async compactions Mutex compact_queue_lock; Cond compact_queue_cond; list< pair<string,string> > compact_queue; bool compact_queue_stop; class CompactThread : public Thread { RocksDBStore *db; public: explicit CompactThread(RocksDBStore *d) : db(d) {} void *entry() override { db->compact_thread_entry(); return NULL; } friend class RocksDBStore; } compact_thread; ... structRocksWBHandler: public rocksdb::WriteBatch::Handler { std::string seen ; int num_seen = 0; }; class RocksDBTransactionImpl : public KeyValueDB::TransactionImpl { public: rocksdb::WriteBatch bat; RocksDBStore *db; }; // DB Iterator的具體實現,比較重要 class RocksDBWholeSpaceIteratorImpl : public KeyValueDB::WholeSpaceIteratorImpl { protected: rocksdb::Iterator *dbiter; public: explicit RocksDBWholeSpaceIteratorImpl(rocksdb::Iterator *iter) : dbiter(iter) { } //virtual ~RocksDBWholeSpaceIteratorImpl() { } ~RocksDBWholeSpaceIteratorImpl() override; int seek_to_first() override; int seek_to_first(const string &prefix) override; int seek_to_last() override; int seek_to_last(const string &prefix) override; int upper_bound(const string &prefix, const string &after) override; int lower_bound(const string &prefix, const string &to) override; bool valid() override; int next() override; int prev() override; string key() override; pair<string,string> raw_key() override; bool raw_key_is_prefixed(const string &prefix) override; bufferlist value() override; bufferptr value_as_ptr() override; int status() override; size_t key_size() override; size_t value_size() override; }; ... };
基類 KeyValueDB 的定義如下,只羅列了幾個關鍵的基類定義:
/** * Defines virtual interface to be implemented by key value store * * Kyoto Cabinet or LevelDB should implement this */ class KeyValueDB { public: class TransactionImpl { ... }; typedef ceph::shared_ptr< TransactionImpl > Transaction; class WholeSpaceIteratorImpl { ... }; typedef ceph::shared_ptr< WholeSpaceIteratorImpl > WholeSpaceIterator; class IteratorImpl : public GenericIteratorImpl { const std::string prefix; WholeSpaceIterator generic_iter; ... int seek_to_first() override { return generic_iter->seek_to_first(prefix); } int seek_to_last() { return generic_iter->seek_to_last(prefix); } int upper_bound(const std::string &after) override { return generic_iter->upper_bound(prefix, after); } int lower_bound(const std::string &to) override { return generic_iter->lower_bound(prefix, to); } bool valid() override { if (!generic_iter->valid()) return false; return generic_iter->raw_key_is_prefixed(prefix); } }; typedef ceph::shared_ptr< IteratorImpl > Iterator; WholeSpaceIterator get_iterator() { return _get_iterator(); } Iterator get_iterator(const std::string &prefix) { return std::make_shared<IteratorImpl>(prefix, get_iterator()); } };
在程式碼中,使用RocksDB的常用方法如下:
KeyValueDB::Iterator it; it = db->get_iterator(PREFIX_OBJ);// 設定key的字首 it->lower_bound(key); / it->upper_bound(key);// 找到對應key的iterator位置 while (it->valid()) {// 檢查iterator是否有效 ... it->key() / it->value();;// 獲取iterator對應的key或value it->next();// 獲取下一個iterator位置 }
RocksDB裡KV分類
BlueStore裡所有的kv資料都可以儲存在RocksDB裡,實現中通過資料的字首分類,如下:
// kv store prefixes const string PREFIX_SUPER = "S";// field -> value const string PREFIX_STAT = "T";// field -> value(int64 array) const string PREFIX_COLL = "C";// collection name -> cnode_t const string PREFIX_OBJ = "O";// object name -> onode_t const string PREFIX_OMAP = "M";// u64 + keyname -> value const string PREFIX_DEFERRED = "L";// id -> deferred_transaction_t const string PREFIX_ALLOC = "B";// u64 offset -> u64 length (freelist) const string PREFIX_SHARED_BLOB = "X"; // u64 offset -> shared_blob_t
下面針對每一類字首做詳細介紹:
1) PREFIX_SUPER
BlueStore的超級塊資訊,裡面BlueStore自身的元資料資訊,比如:
Sblobid_max Sbluefs_extents Sfreelist_type Smin_alloc_size Smin_compat_ondisk_format Snid_max Sondisk_format
2) PREFIX_STAT
bluestore_statfs 資訊
class BlueStore : public ObjectStore, public md_config_obs_t { ... struct volatile_statfs { enum { STATFS_ALLOCATED = 0, STATFS_STORED, STATFS_COMPRESSED_ORIGINAL, STATFS_COMPRESSED, STATFS_COMPRESSED_ALLOCATED, STATFS_LAST }; int64_t values[STATFS_LAST]; ... }; 設定地方: void BlueStore::_txc_update_store_statfs(TransContext *txc) { if (txc->statfs_delta.is_empty()) return; ... { std::lock_guard<std::mutex> l(vstatfs_lock); vstatfs += txc->statfs_delta; } bufferlist bl; txc->statfs_delta.encode(bl); txc->t->merge(PREFIX_STAT, "bluestore_statfs", bl); txc->statfs_delta.reset(); }
3) PREFIX_COLL
Collection的元資料資訊,Collection對應邏輯上的PG,每個ObjectStore都會實現自己的Collection;
BlueStore儲存一個PG,就會儲存一個Collection的kv到RocksDB;
class BlueStore : public ObjectStore, public md_config_obs_t { ... typedef boost::intrusive_ptr<Collection> CollectionRef; struct Collection : public CollectionImpl { BlueStore *store; Cache *cache;///< our cache shard coll_t cid; bluestore_cnode_t cnode; RWLock lock; bool exists; SharedBlobSet shared_blob_set;///< open SharedBlobs // cache onodes on a per-collection basis to avoid lock // contention. OnodeSpace onode_map; //pool options pool_opts_t pool_opts; ... }; }
4) PREFIX_OBJ
Object的元資料資訊,對於存在BlueStore裡的任何一個Object,都會把其的struct Onode資訊(+其他)作為value寫入RocksDB;
需要訪問該Object時,先查詢RocksDB,構造出其記憶體資料結構Onode,再訪問之;
class BlueStore : public ObjectStore, public md_config_obs_t { ... /// an in-memory object struct Onode { std::atomic_int nref;///< reference count Collection *c; ghobject_t oid; /// key under PREFIX_OBJ where we are stored mempool::bluestore_cache_other::string key; boost::intrusive::list_member_hook<> lru_item; bluestore_onode_t onode;///< metadata stored as value in kv store bool exists;///< true if object logically exists ExtentMap extent_map; ... }; typedef boost::intrusive_ptr<Onode> OnodeRef; }
5) PREFIX_OMAP
Object的OMap資訊,之前儲存在Object的attr和Map資訊,都通過PREFIX_OMAP字首儲存在RocksDB裡;
6) PREFIX_DEFERRED
BlueStore Deferred transaction的資訊,對應資料結構定義如下:
/// writeahead-logged transaction struct bluestore_deferred_transaction_t { uint64_t seq = 0; list<bluestore_deferred_op_t> ops; interval_set<uint64_t> released;///< allocations to release after tx bluestore_deferred_transaction_t() : seq(0) {} DENC(bluestore_deferred_transaction_t, v, p) { DENC_START(1, 1, p); denc(v.seq, p); denc(v.ops, p); denc(v.released, p); DENC_FINISH(p); } void dump(Formatter *f) const; static void generate_test_instances(list<bluestore_deferred_transaction_t*>& o); }; WRITE_CLASS_DENC(bluestore_deferred_transaction_t)
7) PREFIX_ALLOC
FreelistManager相關,預設使用BitmapFreelistManager;
Bblocks Bblocks_per_key Bbytes_per_block Bsize
8) PREFIX_SHARED_BLOB
Shared blob的元資料資訊,因為blob的size比較大,有可能上面的多個extent maps對映下來;
RocksDB tool
ceph提供了一個命令來獲取一個kvstore裡的資料:ceph-kvstore-tool,help如下:
root@ceph6:~# ceph-kvstore-tool -h Usage: ceph-kvstore-tool <leveldb|rocksdb|bluestore-kv> <store path> command [args...] Commands: list [prefix] list-crc [prefix] exists <prefix> [key] get <prefix> <key> [out <file>] crc <prefix> <key> get-size [<prefix> <key>] set <prefix> <key> [ver <N>|in <file>] rm <prefix> <key> rm-prefix <prefix> store-copy <path> [num-keys-per-tx] [leveldb|rocksdb|...] store-crc <path> compact compact-prefix <prefix> compact-range <prefix> <start> <end> repair
使用示例:
root@ceph6:~# systemctl stop [email protected] root@ceph6:~# ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-20/ list B > list-B 2018-09-21 11:43:42.679 7f4ec14deb801 bluestore(/var/lib/ceph/osd/ceph-20/) _mount path /var/lib/ceph/osd/ceph-20/ 2018-09-21 11:43:42.679 7f4ec14deb801 bdev create path /var/lib/ceph/osd/ceph-20//block type kernel 2018-09-21 11:43:42.679 7f4ec14deb801 bdev(0x55ddf4e58000 /var/lib/ceph/osd/ceph-20//block) open path /var/lib/ceph/osd/ceph-20//block 2018-09-21 11:43:42.679 7f4ec14deb801 bdev(0x55ddf4e58000 /var/lib/ceph/osd/ceph-20//block) open size 4000783007744 (0x3a381400000, 3.6 TiB) block_size 4096 (4 KiB) rotational 2018-09-21 11:43:42.679 7f4ec14deb801 bluestore(/var/lib/ceph/osd/ceph-20/) _set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0 2018-09-21 11:43:42.679 7f4ec14deb801 bdev create path /var/lib/ceph/osd/ceph-20//block.db type kernel 2018-09-21 11:43:42.679 7f4ec14deb801 bdev(0x55ddf4e58380 /var/lib/ceph/osd/ceph-20//block.db) open path /var/lib/ceph/osd/ceph-20//block.db 2018-09-21 11:43:42.679 7f4ec14deb801 bdev(0x55ddf4e58380 /var/lib/ceph/osd/ceph-20//block.db) open size 3221225472 (0xc0000000, 3 GiB) block_size 4096 (4 KiB) non-rotational 2018-09-21 11:43:42.679 7f4ec14deb801 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-20//block.db size 3 GiB 2018-09-21 11:43:42.679 7f4ec14deb801 bdev create path /var/lib/ceph/osd/ceph-20//block type kernel 2018-09-21 11:43:42.679 7f4ec14deb801 bdev(0x55ddf4e58700 /var/lib/ceph/osd/ceph-20//block) open path /var/lib/ceph/osd/ceph-20//block 2018-09-21 11:43:42.683 7f4ec14deb801 bdev(0x55ddf4e58700 /var/lib/ceph/osd/ceph-20//block) open size 4000783007744 (0x3a381400000, 3.6 TiB) block_size 4096 (4 KiB) rotational 2018-09-21 11:43:42.683 7f4ec14deb801 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-20//block size 3.6 TiB 2018-09-21 11:43:42.683 7f4ec14deb801 bdev create path /var/lib/ceph/osd/ceph-20//block.wal type kernel 2018-09-21 11:43:42.683 7f4ec14deb801 bdev(0x55ddf4e58a80 /var/lib/ceph/osd/ceph-20//block.wal) open path /var/lib/ceph/osd/ceph-20//block.wal 2018-09-21 11:43:42.683 7f4ec14deb801 bdev(0x55ddf4e58a80 /var/lib/ceph/osd/ceph-20//block.wal) open size 3221225472 (0xc0000000, 3 GiB) block_size 4096 (4 KiB) non-rotational 2018-09-21 11:43:42.683 7f4ec14deb801 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-20//block.wal size 3 GiB 2018-09-21 11:43:42.683 7f4ec14deb801 bluefs mount 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option compaction_readahead_size = 2097152 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option compression = kNoCompression 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option max_write_buffer_number = 4 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option min_write_buffer_number_to_merge = 1 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option recycle_log_file_num = 4 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option writable_file_max_buffer_size = 0 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option write_buffer_size = 268435456 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option compaction_readahead_size = 2097152 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option compression = kNoCompression 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option max_write_buffer_number = 4 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option min_write_buffer_number_to_merge = 1 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option recycle_log_file_num = 4 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option writable_file_max_buffer_size = 0 2018-09-21 11:43:42.691 7f4ec14deb800set rocksdb option write_buffer_size = 268435456 2018-09-21 11:43:42.691 7f4ec14deb801 rocksdb: do_open column families: [default] 2018-09-21 11:43:42.699 7f4ec14deb801 bluestore(/var/lib/ceph/osd/ceph-20/) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152 2018-09-21 11:43:42.703 7f4ec14deb801 bluestore(/var/lib/ceph/osd/ceph-20/) umount 2018-09-21 11:43:42.703 7f4ec14deb801 bluefs umount 2018-09-21 11:43:42.703 7f4ec14deb801 stupidalloc 0x0x55ddf4a92a70 shutdown 2018-09-21 11:43:42.703 7f4ec14deb801 stupidalloc 0x0x55ddf4a92ae0 shutdown 2018-09-21 11:43:42.703 7f4ec14deb801 stupidalloc 0x0x55ddf4a92b50 shutdown 2018-09-21 11:43:42.703 7f4ec14deb801 bdev(0x55ddf4e58a80 /var/lib/ceph/osd/ceph-20//block.wal) close 2018-09-21 11:43:42.991 7f4ec14deb801 bdev(0x55ddf4e58380 /var/lib/ceph/osd/ceph-20//block.db) close 2018-09-21 11:43:43.227 7f4ec14deb801 bdev(0x55ddf4e58700 /var/lib/ceph/osd/ceph-20//block) close 2018-09-21 11:43:43.463 7f4ec14deb801 bdev(0x55ddf4e58000 /var/lib/ceph/osd/ceph-20//block) close root@ceph6:~# systemctl start [email protected]