年初分享过聊聊 Go failpoint 使用 ,感兴趣的可以看看看这篇文章
Failpoints 是一个允许在运行时注入错误或是其它行为的工具,主要用于测试目的,包括 ut 单测,集成压测等等。测试的内容包括状态机错误,磁盘错误,网络 IO 延迟
可以注入的行为有:panic, early returns, sleeping 等等,注入的形为可以通过环境变量或代码进行控制。一般推荐用 http 或集成公司的配置平台,触发规则可以是次数,概率或是两种的结合
入门案例 首先配置依赖,Cargo.toml
1 2 [dependencies] fail = "0.4"  
我们依赖 0.4 版本
1 2 3 4 5 6 7 8 9 10 11 12 13 use  fail::{fail_point, FailScenario};fn  do_fallible_work     fail_point!("read-dir" );     println! ("mock working now" ); } fn  main     let  scenario = FailScenario::setup();     do_fallible_work();     scenario.teardown();     println! ("done" ); } 
do_fallible_work 函数只做两件事情,执行 read-dir 注入点,打印消息用于模拟函数处理请求
1 2 3 $ FAILPOINTS=read-dir="panic"  cargo run mock working now done 
通过环境变量注入 panic 语句,条件编译默认没有开启,所以正常输出
1 2 3 4 $ FAILPOINTS=read-dir="panic"  cargo run --features fail/failpoints mock working now thread 'main ' panicked at 'failpoint  read-dir panic', /Users/zerun.dong/.cargo/registry/src/github.com-1 ecc6299db9ec823/fail-0.4 .0 /src/lib.rs:488 :25  note: run with `RUST_BACKTRACE=1 ` environment variable to display a backtrace 
cargo 指定 --features fail/failpoints, 发生 Panic 符合预期
1 FAILPOINTS=read-dir="sleep(2000)"  cargo run --features fail/failpoints 
当然我们也可以指定其它行为,比如 sleep(2000) 休眠 2 秒
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 use  fail::{fail_point, FailScenario};use  std::io;fn  do_fallible_work Result <()>{    println! ("mock working now" );     fail_point!("read-dir" , |_| {         Err (io::Error::new(io::ErrorKind::PermissionDenied, "error" ))     });     Ok (()) } fn  main Result <()> {    let  scenario = FailScenario::setup();     do_fallible_work()?;     do_fallible_work()?;     scenario.teardown();     println! ("done" );     Ok (()) } 
这是测试提前返回 early return 的案例,需要使用闭包来封装 error
1 2 3 $ FAILPOINTS=read-dir=return  cargo run --features fail/failpoints mock working now Error: Custom { kind: PermissionDenied, error: "error"  } 
上面是普通用法,也可以指定多个 action
1 2 3 4 $ FAILPOINTS=read-dir="1*sleep(2000)->return"  cargo run --features fail/failpoints mock working now mock working now Error: Custom { kind: PermissionDenied, error: "error"  } 
"1*sleep(2000)->return" 表示第一次休眠 2 秒,然后第二次时提前返回。关于更多高级用法,请参考官网 https://docs.rs/fail 
零性能消耗 最重要的要求是:集成 Failpoint 的代码,在线上正式环境运行时,要做到零性能消耗 
1 2 3 4 5 func  test ()     failpoint.Inject("testValue" , func (v failpoint.Value)          fmt.Println(v)     }) } 
这是 go 测试代码,failpoint.Inject 是 marker 函数,参数是名称和闭包
1 2 func  Inject (fpname string , fpbody interface {}) 
由于 Inject 是空函数体,编译时会被优化掉,所以运行时零性能消耗。当线下测试时,需要执行 failpoint-ctl 将所有 marker 函数转化成注入函数
1 2 3 4 5 func  test ()  if  v, _err_ := failpoint.Eval(_curpkg_("testValue" )); _err_ == nil  {   fmt.Println(v)  } } 
上面是转换后的代码,原理不难,解析 AST 替换语法树。那么 rust 如何实现呢?答案是 macro 宏 + 条件编译 
1 2 3 4 5 6 7 8 #[macro_export] #[cfg(not(feature = "failpoints" ))] macro_rules!  fail_point {    ($name:expr, $e:expr) => {{}};     ($name:expr) => {{}};     ($name:expr, $cond:expr, $e:expr) => {{}}; } 
当 cargo build 编译时未指定 failpints feature, fail_point 宏对应空实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #[cfg(feature = "failpoints" )] macro_rules!  fail_point {    ($name:expr) => {{         $crate::eval($name, |_| {             panic! ("Return is not supported for the fail point \"{}\"" , $name);         });     }};     ($name:expr, $e:expr) => {{         if  let  Some (res) = $crate::eval($name, $e) {             return  res;         }     }};     ($name:expr, $cond:expr, $e:expr) => {{         if  $cond {             fail_point!($name, $e);         }     }}; } 
指定 feature 时,对应上面的宏实现,编译期展开成相应的逻辑代码。fail_point 宏有三种形式,模式匹配到不同的参数表达式 (designators) 对应不同代码块
单个参数 name 字符串,可以执行 panic, print, sleep, pause 四种行为 
两个参数 name, e 这里面 e 是闭包,可以做到 early return 提前返回 
三个参数 name, cond, e, 其中 cond 是条件表达式,应该返回 bool 值,e 是闭包。根据条件来执行对应注入 
 
实现原理 1.注册中心 1 2 3 4 5 6 7 8 9 10 11 12 13 type  Registry String , Arc<FailPoint>>;#[derive(Debug, Default)] struct  FailPointRegistry          registry: RwLock<Registry>, } lazy_static::lazy_static! {     static  ref  REGISTRY: FailPointRegistry = FailPointRegistry::default();     static  ref  SCENARIO: Mutex<&'static  FailPointRegistry> = Mutex::new(®ISTRY); } 
注册中心 Registry 是 HashMap 类型,key 是上面测试例子的 name, value 是 Arc<Failpoint> 类型,Arc 用于并发环境下共享所有权 
1 2 3 4 5 6 struct  FailPoint     pause: Mutex<bool >,     pause_notifier: Condvar,     actions: RwLock<Vec <Action>>,     actions_str: RwLock<String >, } 
pause 表示是否暂停,pause_notifier 用于暂停通知,actions 是一个数组,因为一个 fail_point 注入可以有多个动作,actions_str 是表示任务的字符串,通过 from_str 转化成 action 结构体
2.生成任务 FailScenario::setup() 通过获取 FAILPOINTS 环境变量来初始化注入动作,暂时不支持通过 http 方式
解析后通过 set 函数将多个注入动作解析,注册到上文提到的 Registry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 fn  set     registry: &mut  HashMap<String , Arc<FailPoint>>,     name: String ,     actions: &str , ) -> Result <(), String > {     let  actions_str = actions;          let  actions = actions         .split("->" )         .map(Action::from_str)         .collect::<Result <_, _>>()?;               let  p = registry         .entry(name)         .or_insert_with(|| Arc::new(FailPoint::new()));     p.set_actions(actions_str, actions);     Ok (()) } 
这里面用 Action::from_str 将字符串解析成 Action
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #[derive(Clone, Debug, PartialEq)] enum  Task          Off,          Return(Option <String >),          Sleep(u64 ),          Panic(Option <String >),          Print(Option <String >),          Pause,          Yield,          Delay(u64 ),          Callback(SyncCallback), } #[derive(Debug)] struct  Action     task: Task,     freq: f32 ,     count: Option <AtomicUsize>, } 
Action 类型都不一样,freq 控制频率,count 控制触发次数
3.触发任务 大前提肯定是条件编译打开了 failpoint, 直接看 macro 实现
1 2 3 4 5 6 7 8 9 10 pub  fn  eval FnOnce (Option <String >) -> R>(name: &str , f: F) -> Option <R> {    let  p = {         let  registry = REGISTRY.registry.read().unwrap();         match  registry.get(name) {             None  => return  None ,             Some (p) => p.clone(),         }     };     p.eval(name).map(f) } 
逻辑比较简单,从 Registry 注册中心 map 找到对应 failpoint, 然后调用 failpoint.eval 函数,并且针对所有返回值执行闭句 f (如果有值)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 #[cfg_attr(feature = "cargo-clippy" , allow(clippy::option_option))] fn  eval self , name: &str ) -> Option <Option <String >> {    let  task = {         let  actions = self .actions.read().unwrap();         match  actions.iter().filter_map(Action::get_task).next() {             Some (Task::Pause) => {                 let  mut  guard = self .pause.lock().unwrap();                 *guard = true ;                 loop  {                     guard = self .pause_notifier.wait(guard).unwrap();                     if  !*guard {                         break ;                     }                 }                 return  None ;             }             Some (t) => t,             None  => return  None ,         }     };     match  task {         Task::Off => {}         Task::Return(s) => return  Some (s),         Task::Sleep(t) => thread::sleep(Duration::from_millis(t)),         Task::Panic(msg) => match  msg {             Some (ref  msg) => panic! ("{}" , msg),             None  => panic! ("failpoint {} panic" , name),         },         Task::Print(msg) => match  msg {             Some (ref  msg) => log::info!("{}" , msg),             None  => log::info!("failpoint {} executed." , name),         },         Task::Pause => unreachable! (),         Task::Yield => thread::yield_now(),         Task::Delay(t) => {             let  timer = Instant::now();             let  timeout = Duration::from_millis(t);             while  timer.elapsed() < timeout {}         }         Task::Callback(f) => {             f.run();         }     }     None  } 
eval 函数不难,首先调用 get_task 获取要执行的 Action, 这里 Pause 动作单独处理,其它的通过 match 模式匹配。同时也能看到,如果 Return 不指定闭包 f, 那么返回值是 Some(“”), 触发 macro 的默认 panic 闭包 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 fn  get_task self ) -> Option <Task> {  use  rand::Rng;   if  let  Some (ref  cnt) = self .count {       let  c = cnt.load(Ordering::Acquire);       if  c == 0  {           return  None ;       }   }   if  self .freq < 1f32  && !rand::thread_rng().gen_bool(f64 ::from(self .freq)) {       return  None ;   }   if  let  Some (ref  ref_cnt) = self .count {       let  mut  cnt = ref_cnt.load(Ordering::Acquire);       loop  {           if  cnt == 0  {               return  None ;           }           let  new_cnt = cnt - 1 ;           match  ref_cnt.compare_exchange_weak(               cnt,               new_cnt,               Ordering::AcqRel,               Ordering::Acquire,           ) {               Ok (_) => break ,               Err (c) => cnt = c,           }       }   }   Some (self .task.clone()) } 
get_task 先判断执行次数,如果为 0 返回空。然后判断频率,如果没有触发返回空,最后再判断一次计数,并 cas 更新。这里 count 计数字段类型是 Option<AtomicUsize>, 如果不指定次数默认无限制
小结 写文章不容易,如果对大家有所帮助和启发,请大家帮忙点击在看,点赞,分享 三连
关于 Failpoint 大家有什么看法,欢迎留言一起讨论,大牛多留言 ^_^