您的位置:新葡亰496net > 网络数据库 > 新葡亰496net:开源项目rocksdb分析,Compaction原理分

新葡亰496net:开源项目rocksdb分析,Compaction原理分

发布时间:2019-09-15 20:16编辑:网络数据库浏览(169)

    概述

           前面会在微信徒人号中推送后续的翻译章节,与TensorFlow的首先次接触连串已整理成pdf,关切民众号后重操旧业:tensorflow就能够下载~~公众号:源码之心

    一 罗克sDB的磁盘数据组织档期的顺序

    rocksdb 介绍

    罗克sdb一样是一种基于operation log的文件系统。
    出于选取了op log,将对磁盘的随便写调换来了对op log的逐一写,最新的数量是积攒在内部存款和储蓄器的memrory中,能够巩固IO效能。每二个的column family分别有一个memtable与sstablle.当某一coloumn family内部存款和储蓄器中的memory table超过阈值时,调换来immute memtable并创建新的op log,immute memtable由一层层的memtable组成,它们是只读的,可供查询,不可能更新数据。当immute memtable的多寡超越设置的数值时,会触发flush,DB会调治后台线程将两个memtable合并后再dump到磁盘生成Level0中三个新的sstable文件,Level0中的sstable文件不断集合,会触发compaction,DB会调治后台compaction线程将Level0中的sstable文件依靠key与Level1中的sstable合併并生成新的sstable,依次类推,依照key的半空中从低层往上compact,最终产生了一百年不遇的布局,层级数目是由顾客安装的。

    leveldb中,memtable在内部存款和储蓄器中大旨s的数据结构为skiplist,而在rocksdb中,memtable在内部存储器中的方式有三种:skiplist、hash-skiplist、hash-linklist,从字面中就足以看出数据结构的大要情势,hash-skiplist正是种种hash bucket中是贰个skiplist,hash-linklist中,每一种hash bucket中是贰个link-list,启用何用数据结构可在布署中挑选。

         compaction首要不外乎两类:将内存中imutable 转储到磁盘上sst的经过称之为flush只怕minor compaction;磁盘上的sst文件从低层向高层转储的长河称之为compaction只怕是major compaction。对于myrocks来讲,compaction进程都由后台线程触发,对于minor compaction和major compaction分别对应一组线程,通过参数rocksdb_max_background_flushes和rocksdb_max_background_compactions能够来支配。通过minor compaction,内部存款和储蓄器中的数据持续地写入的磁盘,保险有丰富的内存来应对新的写入;而透过major compaction,多层之间的SST文件的再一次数据和无效的数额能够便捷回降,进而降低sst文件占用的磁盘空间。对于读来讲,由于需要看望的sst文件减少了,也可以有质量的晋级换代。由于compaction进程在后台不断地做,单位时间内compaction的源委相当少,不会耳濡目染总体的习性,当然这些可以依照实际的现象对参数进行调节,compaction的全部架构能够参见图1。精晓了compaction的基本概念,下边会详细介绍compaction的流程,主要总结两部分flush(minor compaction),compaction(major compaction),对应的入口函数分别是BackgroundFlush和BackgroundCompaction。

    新葡亰496net 1

    1 磁盘文件的团组织章程

    rocksdb在磁盘上的文件是分为多层的,分小名字为level-0, level-1等等
    level0上含蓄的文书,是由内部存款和储蓄器中的memtable dump到磁盘上扭转的,单个文件之中按key有序,文件之间冬天。
    别的level上的五个文本都是服从key有序的。

    新葡亰496net 2

    sst文件在磁盘上的组织章程

    日志结构树

    从概念上说,最宗旨的LSM是很简短的 。将事先运用贰个大的找寻结构(变成随机读写,影响写品质),转变为将写操作顺序的保留到有些形似的静止文件(约等于sstable)中。所以每一种文件包含短期内的局地更改。因为文件是有序的,所以往来查找也会一点也不慢。文件是不足修改的,他们世世代代不会被更新,新的立异操作只会写到新的文本中。读操作检查很 有的公文。通过周期性的集结那一个文件来减弱文件个数。

    新葡亰496net 3

    源码之心

    2 data range partition

    非0 level上的key,按序分片,保存在不一样的公文中。

    新葡亰496net 4

    data range partition

    读操作

    新葡亰496net 5

    LevelDb 读操作顺序

    Status DBImpl::Get(const ReadOptions& read_options,
                       ColumnFamilyHandle* column_family, const Slice& key,
                       PinnableSlice* value) {
      return GetImpl(read_options, column_family, key, value);
    }
    
    Status DBImpl::GetImpl(const ReadOptions& read_options,
                           ColumnFamilyHandle* column_family, const Slice& key,
                           PinnableSlice* pinnable_val, bool* value_found,
                           ReadCallback* callback, bool* is_blob_index) {
      assert(pinnable_val != nullptr);
      StopWatch sw(env_, stats_, DB_GET);
      PERF_TIMER_GUARD(get_snapshot_time);
    
      auto cfh = reinterpret_cast<ColumnFamilyHandleImpl*>(column_family);
      auto cfd = cfh->cfd();
    
      // Acquire SuperVersion
    // holds references to memtable, all immutable memtables and version
    
      SuperVersion* sv = GetAndRefSuperVersion(cfd);
    
      TEST_SYNC_POINT("DBImpl::GetImpl:1");
      TEST_SYNC_POINT("DBImpl::GetImpl:2");
    
      SequenceNumber snapshot;
      if (read_options.snapshot != nullptr) {
        // Note: In WritePrepared txns this is not necessary but not harmful either.
        // Because prep_seq > snapshot => commit_seq > snapshot so if a snapshot is
        // specified we should be fine with skipping seq numbers that are greater
        // than that.
    
    // Abstract handle to particular state of a DB.
    // A Snapshot is an immutable object and can therefore be safely
    // accessed from multiple threads without any external synchronization.
    
        snapshot = reinterpret_cast<const SnapshotImpl*>(
            read_options.snapshot)->number_;
      } else {
        // Since we get and reference the super version before getting
        // the snapshot number, without a mutex protection, it is possible
        // that a memtable switch happened in the middle and not all the
        // data for this snapshot is available. But it will contain all
        // the data available in the super version we have, which is also
        // a valid snapshot to read from.
        // We shouldn't get snapshot before finding and referencing the
        // super versipon because a flush happening in between may compact
        // away data for the snapshot, but the snapshot is earlier than the
        // data overwriting it, so users may see wrong results.
        snapshot = last_seq_same_as_publish_seq_
                       ? versions_->LastSequence()
                       : versions_->LastPublishedSequence();
      }
      TEST_SYNC_POINT("DBImpl::GetImpl:3");
      TEST_SYNC_POINT("DBImpl::GetImpl:4");
    
      // Prepare to store a list of merge operations if merge occurs.
      MergeContext merge_context;
      RangeDelAggregator range_del_agg(cfd->internal_comparator(), snapshot);
    
      Status s;
      // First look in the memtable, then in the immutable memtable (if any).
      // s is both in/out. When in, s could either be OK or MergeInProgress.
      // merge_operands will contain the sequence of merges in the latter case.
      LookupKey lkey(key, snapshot);
      PERF_TIMER_STOP(get_snapshot_time);
    
      bool skip_memtable = (read_options.read_tier == kPersistedTier &&
                            has_unpersisted_data_.load(std::memory_order_relaxed));
      bool done = false;
      if (!skip_memtable) {
        if (sv->mem->Get(lkey, pinnable_val->GetSelf(), &s, &merge_context,
                         &range_del_agg, read_options, callback, is_blob_index)) {
          done = true;
          pinnable_val->PinSelf();
          RecordTick(stats_, MEMTABLE_HIT);
        } else if ((s.ok() || s.IsMergeInProgress()) &&
                   sv->imm->Get(lkey, pinnable_val->GetSelf(), &s, &merge_context,
                                &range_del_agg, read_options, callback,
                                is_blob_index)) {
          done = true;
          pinnable_val->PinSelf();
          RecordTick(stats_, MEMTABLE_HIT);
        }
        if (!done && !s.ok() && !s.IsMergeInProgress()) {
          ReturnAndCleanupSuperVersion(cfd, sv);
          return s;
        }
      }
      if (!done) {
        PERF_TIMER_GUARD(get_from_output_files_time);
        sv->current->Get(read_options, lkey, pinnable_val, &s, &merge_context,
                         &range_del_agg, value_found, nullptr, nullptr, callback,
                         is_blob_index);
        RecordTick(stats_, MEMTABLE_MISS);
      }
    
      {
        PERF_TIMER_GUARD(get_post_process_time);
    
        ReturnAndCleanupSuperVersion(cfd, sv);
    
        RecordTick(stats_, NUMBER_KEYS_READ);
        size_t size = pinnable_val->size();
        RecordTick(stats_, BYTES_READ, size);
        MeasureTime(stats_, BYTES_PER_READ, size);
        PERF_COUNTER_ADD(get_read_bytes, size);
      }
      return s;
    }
    

    里头,LookupKey 完毕如下:

    LookupKey::LookupKey(const Slice& _user_key, SequenceNumber s) {
      size_t usize = _user_key.size();
      size_t needed = usize   13;  // A conservative estimate
      char* dst;
      if (needed <= sizeof(space_)) {
        dst = space_;
      } else {
        dst = new char[needed];
      }
      start_ = dst;
      // NOTE: We don't support users keys of more than 2GB :)
      dst = EncodeVarint32(dst, static_cast<uint32_t>(usize   8));
      kstart_ = dst;
      memcpy(dst, _user_key.data(), usize);
      dst  = usize;
      EncodeFixed64(dst, PackSequenceAndType(s, kValueTypeForSeek));
      dst  = 8;
      end_ = dst;
    }
    

                                              图1

           因为本身对一部分经文的开源项目很风趣味,也想从大腕设计的开源系统中学习架构划设想计经验,所以喜欢分析部分开源代码,此番因为品种中供给选用rocksdb,故在使用的时候留心解析了rocksdb的兑现细节,从二零一四年1月二十六日下决心整治出这一文山会海的blog,也算是对专门的学问的下结论吧。分享出去希望能帮到有必要的意中人。因为事先已经读完LevelDB的源码,读的经过中也参照了网络的有关小说,此小节的介绍会与LevelDB某些类似,毕竟rocksdb是依据LevelDB设计达成的,只在局地地方做了优化而已,有个别代码乃至都以一致的。源码深入分析的前面章节会具体分析rocksdb的兑现。

    3 key在SST文件中查找

    各种level的文本都以一体化稳步,而且文件内平稳的。
    要在有个别level上查找有个别key时:

    • 先依据每种文件的start/end key对持有文件实行二分查找来规定怎么样文件只怕含有key
    • 再经过二分查找在候选的文本中牢固key的标准地点
      那是一次对二个level上具备文件的二分查找的经过。

    写操作

    rocks db 的写操作是透过 WriteImpl达成的,其协会如下所示:

    Status DBImpl::Write(const WriteOptions& write_options, WriteBatch* my_batch) {
      return WriteImpl(write_options, my_batch, nullptr, nullptr);
    }
    
    // The main write queue. This is the only write queue that updates LastSequence.
    // When using one write queue, the same sequence also indicates the last
    // published sequence.
    Status DBImpl::WriteImpl(const WriteOptions& write_options,
                             WriteBatch* my_batch, WriteCallback* callback,
                             uint64_t* log_used, uint64_t log_ref,
                             bool disable_memtable, uint64_t* seq_used,
                             PreReleaseCallback* pre_release_callback) {
      if (my_batch == nullptr) {
        return Status::Corruption("Batch is nullptr!");
      }
      if (write_options.sync && write_options.disableWAL) {
        return Status::InvalidArgument("Sync writes has to enable WAL.");
      }
      if (two_write_queues_ && immutable_db_options_.enable_pipelined_write) {
        return Status::NotSupported(
            "pipelined_writes is not compatible with concurrent prepares");
      }
      if (seq_per_batch_ && immutable_db_options_.enable_pipelined_write) {
        return Status::NotSupported(
            "pipelined_writes is not compatible with seq_per_batch");
      }
      // Otherwise IsLatestPersistentState optimization does not make sense
      assert(!WriteBatchInternal::IsLatestPersistentState(my_batch) ||
             disable_memtable);
    
      Status status;
      if (write_options.low_pri) {
        status = ThrottleLowPriWritesIfNeeded(write_options, my_batch);
        if (!status.ok()) {
          return status;
        }
      }
    
      if (two_write_queues_ && disable_memtable) {
        return WriteImplWALOnly(write_options, my_batch, callback, log_used,
                                log_ref, seq_used, pre_release_callback);
      }
    
      if (immutable_db_options_.enable_pipelined_write) {
        return PipelinedWriteImpl(write_options, my_batch, callback, log_used,
                                  log_ref, disable_memtable, seq_used);
      }
    
      PERF_TIMER_GUARD(write_pre_and_post_process_time);
      WriteThread::Writer w(write_options, my_batch, callback, log_ref,
                            disable_memtable, pre_release_callback);
    
      if (!write_options.disableWAL) {
        RecordTick(stats_, WRITE_WITH_WAL);
      }
    
      StopWatch write_sw(env_, immutable_db_options_.statistics.get(), DB_WRITE);
    
      write_thread_.JoinBatchGroup(&w);
      if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
        // we are a non-leader in a parallel group
        PERF_TIMER_GUARD(write_memtable_time);
    
        if (w.ShouldWriteToMemtable()) {
          ColumnFamilyMemTablesImpl column_family_memtables(
              versions_->GetColumnFamilySet());
          w.status = WriteBatchInternal::InsertInto(
              &w, w.sequence, &column_family_memtables, &flush_scheduler_,
              write_options.ignore_missing_column_families, 0 /*log_number*/, this,
              true /*concurrent_memtable_writes*/, seq_per_batch_);
        }
    
        if (write_thread_.CompleteParallelMemTableWriter(&w)) {
          // we're responsible for exit batch group
          for (auto* writer : *(w.write_group)) {
            if (!writer->CallbackFailed() && writer->pre_release_callback) {
              assert(writer->sequence != kMaxSequenceNumber);
              Status ws = writer->pre_release_callback->Callback(writer->sequence);
              if (!ws.ok()) {
                status = ws;
                break;
              }
            }
          }
          // TODO(myabandeh): propagate status to write_group
          auto last_sequence = w.write_group->last_sequence;
          versions_->SetLastSequence(last_sequence);
          MemTableInsertStatusCheck(w.status);
          write_thread_.ExitAsBatchGroupFollower(&w);
        }
        assert(w.state == WriteThread::STATE_COMPLETED);
        // STATE_COMPLETED conditional below handles exit
    
        status = w.FinalStatus();
      }
      if (w.state == WriteThread::STATE_COMPLETED) {
        if (log_used != nullptr) {
          *log_used = w.log_used;
        }
        if (seq_used != nullptr) {
          *seq_used = w.sequence;
        }
        // write is complete and leader has updated sequence
        return w.FinalStatus();
      }
      // else we are the leader of the write batch group
      assert(w.state == WriteThread::STATE_GROUP_LEADER);
    
      // Once reaches this point, the current writer "w" will try to do its write
      // job.  It may also pick up some of the remaining writers in the "writers_"
      // when it finds suitable, and finish them in the same write batch.
      // This is how a write job could be done by the other writer.
      WriteContext write_context;
      WriteThread::WriteGroup write_group;
      bool in_parallel_group = false;
      uint64_t last_sequence = kMaxSequenceNumber;
      if (!two_write_queues_) {
        last_sequence = versions_->LastSequence();
      }
    
      mutex_.Lock();
    
      bool need_log_sync = write_options.sync;
      bool need_log_dir_sync = need_log_sync && !log_dir_synced_;
      if (!two_write_queues_ || !disable_memtable) {
        // With concurrent writes we do preprocess only in the write thread that
        // also does write to memtable to avoid sync issue on shared data structure
        // with the other thread
        status = PreprocessWrite(write_options, &need_log_sync, &write_context);
      }
      log::Writer* log_writer = logs_.back().writer;
    
      mutex_.Unlock();
    
      // Add to log and apply to memtable.  We can release the lock
      // during this phase since &w is currently responsible for logging
      // and protects against concurrent loggers and concurrent writes
      // into memtables
    
      last_batch_group_size_ =
          write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
    
      if (status.ok()) {
        // Rules for when we can update the memtable concurrently
        // 1. supported by memtable
        // 2. Puts are not okay if inplace_update_support
        // 3. Merges are not okay
        //
        // Rules 1..2 are enforced by checking the options
        // during startup (CheckConcurrentWritesSupported), so if
        // options.allow_concurrent_memtable_write is true then they can be
        // assumed to be true.  Rule 3 is checked for each batch.  We could
        // relax rules 2 if we could prevent write batches from referring
        // more than once to a particular key.
        bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
                        write_group.size > 1;
        size_t total_count = 0;
        size_t valid_batches = 0;
        uint64_t total_byte_size = 0;
        for (auto* writer : write_group) {
          if (writer->CheckCallback(this)) {
            valid_batches  ;
            if (writer->ShouldWriteToMemtable()) {
              total_count  = WriteBatchInternal::Count(writer->batch);
              parallel = parallel && !writer->batch->HasMerge();
            }
    
            total_byte_size = WriteBatchInternal::AppendedByteSize(
                total_byte_size, WriteBatchInternal::ByteSize(writer->batch));
          }
        }
        // Note about seq_per_batch_: either disableWAL is set for the entire write
        // group or not. In either case we inc seq for each write batch with no
        // failed callback. This means that there could be a batch with
        // disalbe_memtable in between; although we do not write this batch to
        // memtable it still consumes a seq. Otherwise, if !seq_per_batch_, we inc
        // the seq per valid written key to mem.
        size_t seq_inc = seq_per_batch_ ? valid_batches : total_count;
    
        const bool concurrent_update = two_write_queues_;
        // Update stats while we are an exclusive group leader, so we know
        // that nobody else can be writing to these particular stats.
        // We're optimistic, updating the stats before we successfully
        // commit.  That lets us release our leader status early.
        auto stats = default_cf_internal_stats_;
        stats->AddDBStats(InternalStats::NUMBER_KEYS_WRITTEN, total_count,
                          concurrent_update);
        RecordTick(stats_, NUMBER_KEYS_WRITTEN, total_count);
        stats->AddDBStats(InternalStats::BYTES_WRITTEN, total_byte_size,
                          concurrent_update);
        RecordTick(stats_, BYTES_WRITTEN, total_byte_size);
        stats->AddDBStats(InternalStats::WRITE_DONE_BY_SELF, 1, concurrent_update);
        RecordTick(stats_, WRITE_DONE_BY_SELF);
        auto write_done_by_other = write_group.size - 1;
        if (write_done_by_other > 0) {
          stats->AddDBStats(InternalStats::WRITE_DONE_BY_OTHER, write_done_by_other,
                            concurrent_update);
          RecordTick(stats_, WRITE_DONE_BY_OTHER, write_done_by_other);
        }
        MeasureTime(stats_, BYTES_PER_WRITE, total_byte_size);
    
        if (write_options.disableWAL) {
          has_unpersisted_data_.store(true, std::memory_order_relaxed);
        }
    
        PERF_TIMER_STOP(write_pre_and_post_process_time);
    
        if (!two_write_queues_) {
          if (status.ok() && !write_options.disableWAL) {
            PERF_TIMER_GUARD(write_wal_time);
            status = WriteToWAL(write_group, log_writer, log_used, need_log_sync,
                                need_log_dir_sync, last_sequence   1);
          }
        } else {
          if (status.ok() && !write_options.disableWAL) {
            PERF_TIMER_GUARD(write_wal_time);
            // LastAllocatedSequence is increased inside WriteToWAL under
            // wal_write_mutex_ to ensure ordered events in WAL
            status = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
                                          seq_inc);
          } else {
            // Otherwise we inc seq number for memtable writes
            last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
          }
        }
        assert(last_sequence != kMaxSequenceNumber);
        const SequenceNumber current_sequence = last_sequence   1;
        last_sequence  = seq_inc;
    
        if (status.ok()) {
          PERF_TIMER_GUARD(write_memtable_time);
    
          if (!parallel) {
            // w.sequence will be set inside InsertInto
            w.status = WriteBatchInternal::InsertInto(
                write_group, current_sequence, column_family_memtables_.get(),
                &flush_scheduler_, write_options.ignore_missing_column_families,
                0 /*recovery_log_number*/, this, parallel, seq_per_batch_);
          } else {
            SequenceNumber next_sequence = current_sequence;
            // Note: the logic for advancing seq here must be consistent with the
            // logic in WriteBatchInternal::InsertInto(write_group...) as well as
            // with WriteBatchInternal::InsertInto(write_batch...) that is called on
            // the merged batch during recovery from the WAL.
            for (auto* writer : write_group) {
              if (writer->CallbackFailed()) {
                continue;
              }
              writer->sequence = next_sequence;
              if (seq_per_batch_) {
                next_sequence  ;
              } else if (writer->ShouldWriteToMemtable()) {
                next_sequence  = WriteBatchInternal::Count(writer->batch);
              }
            }
            write_group.last_sequence = last_sequence;
            write_group.running.store(static_cast<uint32_t>(write_group.size),
                                      std::memory_order_relaxed);
            write_thread_.LaunchParallelMemTableWriters(&write_group);
            in_parallel_group = true;
    
            // Each parallel follower is doing each own writes. The leader should
            // also do its own.
            if (w.ShouldWriteToMemtable()) {
              ColumnFamilyMemTablesImpl column_family_memtables(
                  versions_->GetColumnFamilySet());
              assert(w.sequence == current_sequence);
              w.status = WriteBatchInternal::InsertInto(
                  &w, w.sequence, &column_family_memtables, &flush_scheduler_,
                  write_options.ignore_missing_column_families, 0 /*log_number*/,
                  this, true /*concurrent_memtable_writes*/, seq_per_batch_);
            }
          }
          if (seq_used != nullptr) {
            *seq_used = w.sequence;
          }
        }
      }
      PERF_TIMER_START(write_pre_and_post_process_time);
    
      if (!w.CallbackFailed()) {
        WriteCallbackStatusCheck(status);
      }
    
      if (need_log_sync) {
        mutex_.Lock();
        MarkLogsSynced(logfile_number_, need_log_dir_sync, status);
        mutex_.Unlock();
        // Requesting sync with two_write_queues_ is expected to be very rare. We
        // hance provide a simple implementation that is not necessarily efficient.
        if (two_write_queues_) {
          if (manual_wal_flush_) {
            status = FlushWAL(true);
          } else {
            status = SyncWAL();
          }
        }
      }
    
      bool should_exit_batch_group = true;
      if (in_parallel_group) {
        // CompleteParallelWorker returns true if this thread should
        // handle exit, false means somebody else did
        should_exit_batch_group = write_thread_.CompleteParallelMemTableWriter(&w);
      }
      if (should_exit_batch_group) {
        if (status.ok()) {
          for (auto* writer : write_group) {
            if (!writer->CallbackFailed() && writer->pre_release_callback) {
              assert(writer->sequence != kMaxSequenceNumber);
              Status ws = writer->pre_release_callback->Callback(writer->sequence);
              if (!ws.ok()) {
                status = ws;
                break;
              }
            }
          }
          versions_->SetLastSequence(last_sequence);
        }
        MemTableInsertStatusCheck(w.status);
        write_thread_.ExitAsBatchGroupLeader(write_group, status);
      }
    
      if (status.ok()) {
        status = w.FinalStatus();
      }
      return status;
    }
    

    flush(minor-compaction)

           Rocksdb是facebook开源的NOSQL存款和储蓄系统,其设计是基于Google开源的Leveldb,优化了LevelDB中存在的一对主题素材,其属性堪当要比LevelDB强,rocksdb的安排跟Leveldb的极致类似,读过LevelDB源码的再读rocksdb的源码基本不用压力,rocksdb也囊括了内部存款和储蓄器memtable,LRUcache,磁盘上的sstable,operation log等等。本种类就是从rocksdb的源码品级来剖析其设计完成与品质

    二 数据压缩 Compaction

    Compaction操作

    新葡亰496net 6

    minor Compaction操作

    跳表直接存到本地球磁性盘

    新葡亰496net 7

    major compaction 操作

    多路归并排序算法

          罗克db中在内存的数码都是通过memtable存款和储蓄,首要包括三种样式,active-memtable和immutable-memtable。active-memtable是当下正值提供写操作的memtable,当active-memtable写入当先阀值(通过参数wirte_buffer_size调控),会将以此memtable标志为read-only,然后再成立三个新的memtable供新的写入,这几个read-only的memtable正是immutable-memtable。我们所说的flush操作就是将imumutable-memtable 写入到level0的经过。flush进程以column family为单位展开,多少个column family是一组sst文件的汇集,在myrocks中二个表能够是三个独自的column family,也得以八个表共用二个column family。每一个column family中或者含有四个或七个immutable-memtable,叁个flush线程会抓取column family中具备的immutable-memtable进行merge,然后flush到level0。由于二个线程在flush进程中,新的写入也络绎不绝进入,从而发生新的immutable-memtable,别的flush线程可以新起二个职务实行flush,因而在rocksdb种类下,active-memtable->immutable-memtable->sst文件转变进程是流水作业,何况flush能够并发实践,相对于levelDB,并发compaction的速度要快非常多。通过参数max_write_buffer_number能够决定memtable的总的数量量,假设写入一点也比十分的快,而compaction非常的慢,会招致memtable数量超越阀值,导致write stall的严重后果。其余多少个参数是min_write_buffer_number_to_merge,整个参数是决定至少多少个immutable才会触发flush,暗中同意是1。flush的着力流程如下:

           既然rocksdb是基于leveldb设计完成并优化了一些细节,那大家先看一下leveldb的中坚框架,

    1 L0 compaction

    当L0的文件数量达到level0_file_num_compaction_trigger的值时,触发L0和L1的集结。平日必得将装有L0的文件合併到L第11中学,因为L0的文书的key是有交叠的(overlapping)。

    新葡亰496net 8

    L0与L1的compaction

    Column Family

    从今罗克sDB 3.0引进协助column family,每一个KV数据对能够钦命关联唯一的column family(默以为“default”),做到相对独立的隔绝存款和储蓄;column family提供了逻辑上划分数据库的技能,帮助跨column family实行原子写操作(借助WriteBatch达成)等。

    具有的column family分享WAL日志文件(write-head log),可是各样column family有单独的MemTable和SSTable。
    分享WAL有助于原子操作达成,更迅捷的成组提交。
    独立的MemTable和SSTable则更实惠数据压缩,配置选项,快速的去除某些column family。

    参考

    • 写优化之JoinBatchGroup
    • leveldb 源码剖析(一)
    • LevelDB 源码解析(二):主体结构
    • Log Structured Merge Trees(LSM) 原理
    • FlatBuffers 介绍
    • Android 数据库 ObjectBox源码剖析
    • Column Family-罗克sDB源码剖判(1)
    • rocksdb读/写/空间放大剖析

    1.遍历immutable-list,若无其余线程flush,则投入队列

    新葡亰496net 9

    2 高层Compaction

    当L0 compaction达成后,L1的公文化总同盟size可能文件数量大概会超越阈值,触发L1向L2的联合。从L1至少选用叁个文本,合併到L第22中学key有交叠的文本中。

    新葡亰496net 10

    L1向L2合并

    同样的,合併后或然会触发下一各level的compaction。

    新葡亰496net 11

    联合后的L2

    新葡亰496net 12

    L2向L3合并

    集结后的L3也急需做Compaction.

    新葡亰496net 13

    合併后的L3

    2.透过迭代器逐条扫描key-value,将key-value写入到data-block 

           由该架构图能够看看,Leveldb是由memtable, immute memtable,wal log,sstable组成内部存款和储蓄器中的memtalbe与imm memtable各为一个,imm memtable是由memtable到达阈值后倒车而成的,其数据结构是一律的。这里对于Leveldb的切切实实贯彻细节这里不详细阐释,风趣味的可以参谋下其余Leveldb的源码解析,或许接二连三看后续深入分析章节,对精通Leveldb也很有支持。

    3 并行Compaction

    新葡亰496net 14

    并行compaction

    max_background_compactions调控了并行compaction的最大数量。

    3.万一data block大小已经超先生越block_size(举个例子16k),只怕曾经key-value对是终极的一对,则触发二遍block-flush

           rocksdb对leveldb的优化有:

    4 L0 subcompaction

    L0向L1的compaction无法与其余level compaction并行。那也许产生全部compaction速度的瓶颈,可以经过安装max_subcompactions来加速L0到L1的compaction。

    新葡亰496net 15

    subcompaction

    4.依据压缩算法对block实行削减,并生成对应的index block记录(begin_key, last_key, offset)

    扩张了column family,有了列簇的概念,可把有个别连锁的key存款和储蓄在共同,column famiy的统一准备挺风趣的,后边会单独分析

    5 Compaction的取舍战略

    当三个level都知足触发compaction的基准,rocksdb通过测算得分来选取先做哪二个level的compaction。

    • 对此非0 level,score = 该level文件的里程度 / 阈值。已经正在做compaction的公文不计入总厅长度中。
    • 对此L0,score = max{文件数量 / level0_file_num_compaction_trigger, L0文件总省长度 / max_bytes_for_level_base} 况且 L0文件数量 > level0_file_num_compaction_trigger。
      得分最高的level有限做compaction。

    5.至此几何个block已经写入文件,并为每种block生成了indexblock记录

    内部存款和储蓄器中有五个immute memtalbe,可防止Leveldb中的 write stall

    6 compaction触发阈值

    每一层的compaction阈值设置政策由level_compaction_dynamic_level_bytes来决定。

    6.写入index block,meta block,metaindex block以及footer新闻到文件尾

    可支撑四线程同期compation,理论上八线程同有的时候候comaption会比叁个线程compation要快

    当level_compaction_dynamic_level_bytes为false

    L1 触发阈值:max_bytes_for_level_base
    上面包车型客车level触发阈值通过公式总计:Target_Size(Ln 1) = Target_Size(Ln) * max_bytes_for_level_multiplier * max_bytes_for_level_multiplier_additional[n]. max_bytes_for_level_multiplier_additional

    例如:
    max_bytes_for_level_base = 16384
    max_bytes_for_level_multiplier = 10
    max_bytes_for_level_multiplier_additional = 1
    那么每种level的触及阈值为 L1, L2, L3 and L4 分别为 16384, 163840, 1638400, and 16385000

    7.将转移sst文件的元音讯写入manifest文件

    充实了merge operator,也正是原地更新,优化了modify的频率

    当level_compaction_dynamic_level_bytes为true

    聊起底三个level的文件长度总是恒久的。
    地点level触发阈值通过公式总结:Target_Size(Ln-1) = Target_Size(Ln) / max_bytes_for_level_multiplier
    假若总计获得的值紧跟于 max_bytes_for_level_base / max_bytes_for_level_multiplier, 那么该level将保证为空,L0做compaction时将直接merge到第二个有合法阈值的level上。
    例如:
    max_bytes_for_level_base = 1G
    num_levels = 6
    level 6 size = 276G
    这正是说从L1到L6的触及阈值分别为:0, 0, 0.276G, 2.76G, 27.6G,276G。

    如此分配,保障了休养身息的LSM-tree结构。并且有十分八的数额存款和储蓄在终极一层,9%的多经略使存在尾数第二层。

    新葡亰496net 16

    image.png


    参谋资料:官方wiki

          flush实质是对memtable中的记录举行一遍有序遍历,在这么些进程中会去掉一部分冗余的笔录,然后以block为单位写入sst文件,写入文件时依据压缩计谋鲜明是还是不是对block实行削减。为啥会有冗余记录?这些根本是因为rocksdb中不管insert,update仍然delete,全体的写入操作都是以append的措施写入memtable,譬如先后对key=1的记录实施四个操作insert(1),update(1),delete(1),在rocksdb中会产生3条分裂记录。(在innodb中,对于同贰个key的操作都以原地更新,唯有一条记下)。实际上delete后那么些记录不应有留存了,所以在联合时,可以杀死这个冗余的记录,举例这里的insert(1),update(1),这种联合使得flush到level0的sst已经比较紧密。冗余记录重要有以下二种境况:(user_key, op)表示对user_key的操作,比如put,delete等。

    支持DB级的TTL

    1.对于(user_key,put),(user_key,delete),则能够将put删掉

    flush与compation分开差别的线程池来调节,并具有分歧的优先级,flush要打折compation,那样可以加快flush,幸免stall

    2.对于(user_key,single-delete),(user_key,put),single-delete保险put,delete成对出现,能够何况将两条记下都删掉。

    对SSD存款和储蓄做了优化,能够以in-memory形式运营

    3.对于(user_key,put1),(user_key,put2),(user_key,put3)能够杀死比较老的put

    日增了对 write ahead log(WAL)的管理机制,更方便管理WAL,WAL是binlog文件

    对此上述3种状态,都要思考snapshot,即便要刨除的key在有个别snapshot可知,则不能去除。注意第1种意况,(user_key,delete)那条记下是不可能被去除的,因为对客户来说,那条记下已经不设有了,但出于rocksdb的LSM-tree存款和储蓄结构,这么些user_key的笔录也许在level0,level1或许levelN,所以(user_key, delete)这条记下要保留,直到举办最终一层的compaction操作时手艺将它干掉。第2种情状,single-delete是多少个奇特的delete操作,那一个操作有限帮助了put,delete一定是成对出现的,所以flush时,能够将这两条记下同有时候干掉。 

    地点只要轻松的下结论,更加的多的内部意况还须求更为深入分析,rocksdb的主题框架如下图,

    compaction(major-compaction)

    新葡亰496net 17

           我们通常所说的compaction正是major-compaction,sst文件从低level合併到高level的进度,那个进度与flush过程看似,也是透过迭代器将两个sst文件的key举办merge,遍历key然后成立sst文件。flush的接触条件是immutable memtable的数目是还是不是当先了min_write_buffer_number_to_merge,而compaction的触发条件是两类:文件个数和文件大小。对于level0,触发条件是sst文件个数,通过参数level0_file_num_compaction_trigger调节,score通过sst文件数目与level0_file_num_compaction_trigger的比值获得。level1-levelN触发条件是sst文件的轻重,通过参数max_bytes_for_level_base和max_bytes_for_level_multiplier来调节每一层最大的体积,score是本层当前的总容量与能寄放的最大体积的比率。rocksdb中经过三个职责队列维护compaction职务流,通过推断有个别level是不是满足compaction条件来加入队列,然后从队列中获得职务来进行compact。compaction的严重性流程如下:

    从图1-1与图1-2可以看出,Leveldb的框架与Rocksdb的框架十一分类似,rocksdb从3.0发端帮忙ColumnFamily的定义,所以我们从ColumnFamily的角度来看rocksdb的框架,

    1.第一找score最高的level,固然level的score>1,则选拔从这些level进行compaction

    新葡亰496net 18

    2.依据早晚的政策,从level中接纳一个sst文件进行compact,对于level0,由于sst文件之间(minkey,maxkey)有臃肿,所以大概有两个。

          种种columnfamilyl的meltable与sstable都以分手的,所以每种column family都得以单独布署,全部column family共用同一个WAL log文件,能够确认保障跨column family写入时的原子性

    3.从level中选出的文书,大家能猜想出(minkey,maxkey)

          罗克sdb同样是一种基于operation log的文件系统,

    4.从level 第11中学选出与(minkey,maxkey)有臃肿的sst文件

          由于选拔了op log,将对磁盘的随机写转变到了对op log的逐个写,最新的数额是积攒在内部存款和储蓄器的memrory中,能够做实IO效能。每三个的column family分别有三个memtable与sstablle.当某一coloumn family内部存储器中的memory table当先阈值时,调换到immute memtable并创办新的op log,immute memtable由一多元的memtable组成,它们是只读的,可供查询,不可能更新数据。当immute memtable的数目超越设置的数值时,会触发flush,DB会调解后台线程将八个memtable合并后再dump到磁盘生成Level0中叁个新的sstable文件,Level0中的sstable文件不断聚积,会触发compaction,DB会调整后台compaction线程将Level0中的sstable文件依赖key与Level第11中学的sstable合併并生成新的sstable,依次类推,根据key的空间从低层往上compact,最后形成了一斑斑的布局,层级数目是由客商安装的。

    5.多个sst文件实行归并排序,合併写出到sst文件

    二.

    6.依照压缩攻略,对写出的sst文件举行削减

           在读书rocksdb源码前,需求超前储备下一些基本知识,若是对LevelDB的架构已经相比较熟习的话,基本得以略过此处,最初关怀前边的章节。

    7.合併终止后,利用VersionEdit更新VersionSet,更新计算新闻

    须要知识点

         上边的手续基本介绍了compaction的流程,轻便的话便是挑选某些level的sst文件与level 第11中学存在重叠的sst文件进行合併,然后将统一后的文本写入到level 1层的进程。通过判断每一种level的score是或不是高于1,明显level是不是须要compact;对于level中sst文件的选项,会有二种政策,暗中同意是选拔文件size十分的大,包涵delete记录很多的sst文件,这种文件尽快统一有利于压缩空间。关于选拔sst文件的政策能够参照options.h中的CompactionPri的定义。每回会从level中甄选一个sst文件与下层compact,但由于level0中大概会有几个sst文件存在重叠的界定,由此二次compaction恐怕有多少个level0的sst文件出席。rocksdb后台一般有三个线程推行compact职务,compaction线程不断地从义务队列中取得任务,也会持续地检查各样level是不是需求compact,然后步向到行列,因而全体来看,compact进度是现身的,但出现的主导准则是,多个冒出任务不会有重合的key。对于level0来讲,由于八个sst文件会设有重叠的key范围,依照level0,level 第11中学参预compact的sst文件key范围开展分区,划分为多身材职责进展compact,全部子义务并发推行,都进行到位后,整个compact进度结束。别的还应该有一个标题要表达的是,compact时并非都需求统一,假若level中的输入sst文件与level 第11中学无重叠,则足以一直将文件移到level 第11中学。 

    1. 字节序

    Universal Compaction

    rocksdb中与leveldb类似,数字的积累是little-endian的,也正是说在把int32与int64转变到char*的函数中,是根据先低位再高位的一一贮存的。

          前面介绍的compaction类型是level compaction,在rocksdb中还会有一类compaction,称之为Univeral Compaction。Univeral格局中,全部的sst文件都恐怕存在重叠的key范围。对于Odyssey1,牧马人2,奥迪Q73,...,PRADOn,每一个Tiguan是叁个sst文件,奔驰M级1中满含了新式的数额,而XC90n富含了最老的数额。合并的前提条件是sst文件数目大于level0_file_num_compaction_trigger,若无达成那一个阀值,则不会触发合併。在满意前置条件的事态下,按事先级依次触发之下合併。

    1. Varint

    1.比如空间放大超越一定的比例,则有着sst举办一遍compaction,所谓的full compaction,通过参数max_size_amplification_percent控制。

    在把int32或int64调换来字符串中时,为减小存储空间,接纳变长存款和储蓄,约等于VarInt。变长存款和储蓄的贯彻与PB中的基本均等,即每byte的有效性积存是7bit的,最高的8bit位表示是还是不是甘休, 最高bit位为1时,表示后边还应该有贰个byte的数字,为0时,表示早已终止,具体落实可参谋Encodexxx和Decodexxx种类函数

    2.只要前size(冠道1)小于size(景逸SUV2)在料定比重,暗许1%,则与Sportage1与福睿斯2一齐举行compaction,假设(奇骏1 ENVISION2)*(100 ratio)0<本田UR-V3,则将安德拉3也投入到compaction职分中,依次顺序踏向sst文件

    1. 主干数据结构

    3.要是第1和第2种情景都并未有compaction,则强制选用前N个文件举行合併。

    3.1 Slice

          相对于level compaction,Univeral compaction由于每叁遍联合的文件很多,相对于level compaction的多层合併,写放大十分的小,付出的代价是空中放大相当大。除了前边介绍的level compaction和univeral compaction,rocksdb还协理一种FIFO的compaction。FIFO从名称想到所包含的意义正是先进先出,这种形式周期性地删除旧数据。在FIFO情势下,全数文件都在level0,当sst文件总大小抢先阀值max_table_files_size,则删除最老的sst文件。整个compaction是LSM-tree数据结构的核心,也是rocksDB的中坚,本文梳理了两种compaction方式的中坚流程,里面还应该有众多的细节尚未提到到,风野趣的同班能够在本文的功底上细致阅读源码,加深对compaction的知晓。

    rocksdb的大旨数据结构,成员包涵length与三个对准外界存款和储蓄空间的指针,是二进制安全的,能够蕴含‘',提供了部分可与string/char*互动转换的接口,

    附录

    3.2 Status

    相关文件:

    rocksdb的图景类,将错误号与错误音讯封装,同样是为着节约空间,Status类将再次回到码,错误音讯与长度打包存款和储蓄在一个字符数组中。

    rocksdb/db/flush_job.cc 

    格式如下:

    include/rocksdb/universal_compaction.h

    state_[0..3] ==音讯长度

    rocksdb/db/compaction_job.cc

    state_[4]    ==消息code

    db/compaction_picker.cc

    state_[5..]  ==消息

    rocksdb/table/block_based_table_builder.cc

    3.3 Arena

    连锁接口:

    rocksdb的回顾内部存款和储蓄器池,

    FlushMemTableToOutputFile //flush memtable到level0

    申请内部存款和储蓄器时,将申请到的内部存款和储蓄器块归入std::vector blocks_中,在Arena的生命周期结束后,统一释放掉全部申请到的内部存款和储蓄器,内部结构如图1-3所示。

    FlushJob::Run  //flush memtable 任务

    新葡亰496net 19

    PickMemtablesToFlush //选用能够flush的immutable-memtable

    其他,rocksdb同样能够动用tcmalloc与jemalloc,在性质方面恐怕会有十分大的进步.

    WriteLevel0Table //刷sst文件到level0

    3.4memtable

    BuildTable //完结创设sst文件

    leveldb中,memtable在内部存储器中宗旨s的数据结构为skiplist,而在rocksdb中,memtable在内部存款和储蓄器中的形式有三种:skiplist、hash-skiplist、hash-linklist,从字面中就能够看来数据结构的大致方式,hash-skiplist便是各样hash bucket中是三个skiplist,hash-linklist中,每种hash bucket中是四个link-list,启用何用数据结构可在配置中选拔,上面是skiplist的数据结构:

    UniversalCompactionPicker::NeedsCompaction //是还是不是须求compact

    新葡亰496net 20

    PickCompaction //要求开展compact的sst文件

    下面是hash-skiplist的结构,

    PickCompactionUniversalReadAmp //选拔相邻的sst文件实行统一

    新葡亰496net 21

    NeedsCompaction //判定文件是不是level是或不是须要compact

    上面是hash-linklist的框架图,

    LevelCompactionPicker::PickCompaction // 获取level中sst文件进行compact

    新葡亰496net 22

    LevelCompactionPicker::PickCompactionBySize

    3.5 Cache

    IsTrivialMove // 是还是不是足以活动越来越深的Level,未有overlap的景况下。

    rocksdb内部依据双向链表完结了多个规范的LRUCache,由于LRUCache的宏图完成相比较通用出色,这里详细解析一下LRUCache的贯彻进度,依据LRUCache的从小到大的逐一来看基本组件,

    ShouldFormSubcompactions  // 决断是否足以将compaction职务分片

    A. LRUHandle结构体,Cache中型Mini小的粒度的因素,代表了贰个k/v存储对,上面是LRUHandle的享有音信,

    CompactionJob::Prepare    // 划分子职务

    struct LRUHandle {

    CompactionJob::Run()      // compaction的具体贯彻 

    void* value;  // value信息

    BlockBasedTableBuilder::Finish  //生成sst文件

    void (*deleter)(const Slice&, void* value); //删除元素时,可调用的回调函数

    参照文书档案

    LRUHandle* next_hash; //解决hash争辨时,使用链表法

    LRUHandle* next;//next/prev构成了双链,由LRU算法使用

    LRUHandle* prev;

    size_t charge;      // TODO(opt): Only allow uint32_t?

    size_t key_length; //key的长度

    uint32_t refs;      // a number of refs to this entry

    // cache itself is counted as 1

    bool in_cache;      // true, if this entry is referenced by the hash table

    uint32_t hash;      // Hash of key(); used for fast sharding and comparisons

    char key_data[1];   // Beginning of key

    Slice key() const {

    // For cheaper lookups, we allow a temporary Handle object

    // to store a pointer to a key in "value".

    if (next == this) {

    return *(reinterpret_cast(value));

    } else {

    return Slice(key_data, key_length);

    }

    }

    void Free() {

    assert((refs == 1 && in_cache) || (refs == 0 && !in_cache));

    (*deleter)(key(), value);

    free(this);

    }

    };

    B. 实现了rocksdb本人的HandleTable,其实正是落到实处了上下一心的hash table,  速度堪称比g 4.4.3本子自带的hash table的速度要快比比较多

    class HandleTable {

    public:

    HandleTable() : length_(0), elems_(0), list_(nullptr) { Resize(); }

    template

    void ApplyToAllCacheEntries(T func) {

    for (uint32_t i = 0; i < length_; i ) {

    LRUHandle* h = list_[i];

    while (h != nullptr) {

    auto n = h->next_hash;

    assert(h->in_cache);

    func(h);

    h = n;

    }

    }

    }

    ~HandleTable() {

    ApplyToAllCacheEntries([](LRUHandle* h) {

    if (h->refs == 1) {

    h->Free();

    }

    });

    delete[] list_;

    }

    LRUHandle* Lookup(const Slice& key, uint32_t hash) {

    return *FindPointer(key, hash);

    }

    LRUHandle* Insert(LRUHandle* h) {

    LRUHandle** ptr = FindPointer(h->key(), h->hash);

    LRUHandle* old = *ptr;

    h->next_hash = (old == nullptr ? nullptr : old->next_hash);

    *ptr = h;

    if (old == nullptr) {

    elems_;

    if (elems_ > length_) {

    // Since each cache entry is fairly large, we aim for a small

    // average linked list length (<= 1).

    Resize();

    }

    }

    return old;

    }

    LRUHandle* Remove(const Slice& key, uint32_t hash) {

    LRUHandle** ptr = FindPointer(key, hash);

    LRUHandle* result = *ptr;

    if (result != nullptr) {

    *ptr = result->next_hash;

    --elems_;

    }

    return result;

    }

    private:

    // The table consists of an array of buckets where each bucket is

    // a linked list of cache entries that hash into the bucket.

    uint32_t length_;

    uint32_t elems_;

    LRUHandle** list_;

    // Return a pointer to slot that points to a cache entry that

    // matches key/hash.  If there is no such cache entry, return a

    // pointer to the trailing slot in the corresponding linked list.

    LRUHandle** FindPointer(const Slice& key, uint32_t hash) {

    LRUHandle** ptr = &list_[hash & (length_ - 1)];

    while (*ptr != nullptr &&

    ((*ptr)->hash 新葡亰496net:开源项目rocksdb分析,Compaction原理分析。!= hash || key != (*ptr)->key())) {

    ptr = &(*ptr)->next_hash;

    }

    return ptr;

    }

    void Resize() {

    uint32_t new_length = 16;

    while (new_length < elems_ * 1.5) {

    new_length *= 2;

    }

    LRUHandle** new_list = new LRUHandle*[new_length];

    memset(new_list, 0, sizeof(new_list[0]) * new_length);

    uint32_t count = 0;

    for (uint32_t i = 0; i < length_; i ) {

    LRUHandle* h = list_[i];

    while (h != nullptr) {

    LRUHandle* next = h->next_hash;

    uint32_t hash = h->hash;

    LRUHandle** ptr = &new_list[hash & (new_length - 1)];

    h->next_hash = *ptr;

    *ptr = h;

    h = next;

    count ;

    }

    }

    assert(elems_ == count);

    delete[] list_;

    list_ = new_list;

    length_ = new_length;

    }

    };

    HandleTable的布局也是很简短,正是三番一遍一些hash slot,然后用链表法化解hash 争论,

    新葡亰496net 23

    C. LRUCahe

    LRUCache是由LRUHandle与HandleTable组成,並且LRUCache内部是有锁的,所以外界二十四线程能够安枕无忧采纳。

    HandleTable很好掌握,正是把Cache中的数据hash散列存款和储蓄,可以加快查找速度;

    LRUHandle lru_是个dummy pointer,约等于双链表的头,也正是LRU是由双链表保存的,队头是最初步入Cache的,队尾是终极步入Cache的,所以,在Cache满了急需自由空间的时候是从队头开首的,队尾是刚进来Cache的要素

    class LRUCache {

    public:

    LRUCache();

    ~LRUCache();

    // Separate from constructor so caller can easily make an array of LRUCache

    // if current usage is more than new capacity, the function will attempt to

    // free the needed space

    void SetCapacity(size_t capacity);

    // Like Cache methods, but with an extra "hash" parameter.

    Cache::Handle* Insert(const Slice& key, uint32_t hash,

    void* value, size_t charge,

    void (*deleter)(const Slice& key, void* value));

    Cache::Handle* Lookup(const Slice& key, uint32_t hash);

    void Release(Cache::Handle* handle);

    void Erase(const Slice& key, uint32_t hash);

    // Although in some platforms the update of size_t is atomic, to make sure

    // GetUsage() and GetPinnedUsage() work correctly under any platform, we'll

    // protect them with mutex_.

    size_t GetUsage() const {

    MutexLock l(&mutex_);

    return usage_;

    }

    size_t GetPinnedUsage() const {

    MutexLock l(&mutex_);

    assert(usage_ >= lru_usage_);

    return usage_ - lru_usage_新葡亰496net:开源项目rocksdb分析,Compaction原理分析。;

    }

    void ApplyToAllCacheEntries(void (*callback)(void*, size_t),

    bool thread_safe);

    private:

    void LRU_Remove(LRUHandle* e);

    void LRU_Append(LRUHandle* e);

    // Just reduce the reference count by 1.

    // Return true if last reference

    bool Unref(LRUHandle* e);

    // Free some space following strict LRU policy until enough space

    // to hold (usage_ charge) is freed or the lru list is empty

    // This function is not thread safe - it needs to be executed while

    // holding the mutex_

    void EvictFromLRU(size_t charge,

    autovector* deleted);

    // Initialized before use.

    size_t capacity_;

    // Memory size for entries residing in the cache

    size_t usage_;

    // Memory size for entries residing only in the LRU list

    size_t lru_usage_;

    // mutex_ protects the following state.

    // We don't count mutex_ as the cache's internal state so semantically we

    // don't mind mutex_ invoking the non-const actions.

    mutable port::Mutex mutex_;

    // Dummy head of LRU list.

    // lru.prev is newest entry, lru.next is oldest entry.

    // LRU contains items which can be evicted, ie reference only by cache

    LRUHandle lru_;

    HandleTable table_;

    };

         到那,大家从设计具体就能够看到贰个正经的LRUCache已经改动了,接下去更有趣的是rocksdb又落成了多个ShardedLRUCache,它就是叁个封装类,完毕了分片LRUCache,在多线程使用的时候,根据key散列到分化的分片LRUCache中,以减少锁的竞争,尽量进步品质。上边一行的代码是吃透,

    LRUCache shard_[kNumShards]

    D. 另叁个很有用的正是ENV,基于分化的平台持续完毕了差异的ENV,提供了系统级的各类达成,功用卓殊百战百胜,对于想做跨平台软件的校友很有借鉴意义。ENV的切实可行达成就不贴了,重要正是太多。对于其它的工具类,具体可参照src下的有关落到实处。

    本文由新葡亰496net发布于网络数据库,转载请注明出处:新葡亰496net:开源项目rocksdb分析,Compaction原理分

    关键词:

上一篇:MySQL多表查询,mysql数据库操作

下一篇:没有了