当前位置: 首页>后端>正文

服务启动时出现大量 Hystrix timed-out 第二次排查

异常日志

com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String)  timed-out and no fallback available.
com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String)  short-circuited 
服务启动时出现大量 Hystrix timed-out 第二次排查,第1张
Arthas 耗时

查看 log

$ tail -n 100000 service.log | grep -i "Started Application"
2021-07-27 09:51:33,460 com.xxx.athena.Application:61 [ INFO]:TID:N/A: Started Application in 147.318 seconds (JVM running for 165.223)
2021-07-27 09:51:59,436 org.springframework.boot.SpringApplication:61 [ INFO]:TID:N/A: Started application in 19.344 seconds (JVM running for 191.198)


$ tail -n 100000 service.log | grep "DynamicServerListLoadBalancer"
2021-07-27 09:51:33,906 com.netflix.loadbalancer.BaseLoadBalancer:197 [ INFO]:TID:[Ignored Trace]: Client: user-service instantiated a LoadBalancer: DynamicServerListLoadBalancer:{NFLoadBalancer:name=user-service,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2021-07-27 09:51:33,924 com.netflix.loadbalancer.DynamicServerListLoadBalancer:222 [ INFO]:TID:[Ignored Trace]: Using serverListUpdater PollingServerListUpdater
2021-07-27 09:51:34,010 com.netflix.loadbalancer.DynamicServerListLoadBalancer:150 [ INFO]:TID:[Ignored Trace]: DynamicServerListLoadBalancer for client user-service initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=user-service,current list of Servers=[xxxx:8822, xxxx:8822, xxx:8822],Load balancer stats=Zone stats: {defaultzone=[Zone:defaultzone;       Instance count:3;       Active connections count: 0;    Circuit breaker tripped count: 0;       Active connections per server: 0.0;]

...

2021-07-27 09:51:40,516 com.netflix.loadbalancer.BaseLoadBalancer:197 [ INFO]:TID:[Ignored Trace]: Client: content-provider instantiated a LoadBalancer: DynamicServerListLoadBalancer:{NFLoadBalancer:name=content-provider,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2021-07-27 09:51:40,519 com.netflix.loadbalancer.DynamicServerListLoadBalancer:222 [ INFO]:TID:[Ignored Trace]: Using serverListUpdater PollingServerListUpdater
2021-07-27 09:51:40,551 com.netflix.loadbalancer.DynamicServerListLoadBalancer:150 [ INFO]:TID:[Ignored Trace]: DynamicServerListLoadBalancer for client content-provider initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=content-provider,current list of Servers=[xxxx:8811, xxxx:8811, xxxx:8811],Load balancer stats=Zone stats: {defaultzone=[Zone:defaultzone;  Instance count:3;       Active connections count: 0;    Circuit breaker tripped count: 0;       Active connections per server: 0.0;]

$ tail -n 100000 service.log | grep "com.netflix.hystrix.exception.HystrixRuntimeException" -B 1
2021-07-27 09:51:35,013 org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer:328 [ WARN]:TID:[Ignored Trace]: consume message failed. messageExt:MessageExt [brokerName=broker-2, queueId=0, storeSize=505, queueOffset=3783772, sysFlag=0, bornTimestamp=1627350691944, bornHost=/10.0.16.14:41676, storeTimestamp=1627350691944, storeHost=/10.0.16.11:10911, msgId=0A00100B00002A9F0000003DE9918691, commitLogOffset=265911633553, bodyCRC=1259644652, reconsumeTimes=0, preparedTransactionOffset=0, toString()=Message{topic='waiting-notify-learning-queue', flag=0, properties={MIN_OFFSET=3669598, sw6=1-NzcuNDEyLjE2MjczNTA2OTE5NDQ2Mjgx-NzcuNDEyLjE2MjczNTA2OTE5NDQ2Mjgw-0-77-77-MzA=-I1JvY2tldE1RL3dhaXRpbmctbm90aWZ5LWxlYXJuaW5nLXF1ZXVlL1Byb2R1Y2Vy-I1JvY2tldE1RL3dhaXRpbmctbm90aWZ5LWxlYXJuaW5nLXF1ZXVlL1Byb2R1Y2Vy, MAX_OFFSET=3783773, CONSUME_START_TIME=1627350691948, id=209bb1aa-1160-f9e2-2bad-093f79ee12b9, UNIQ_KEY=AC11000162690B25B0958802E8685AB1, CLUSTER=DefaultCluster, WAIT=false, contentType=text/plain, timestamp=1627350691944}, body=[54, 48, 50, 50, 50, 56, 56], transactionId='null'}]
com.xxxx.api.result.ApiException: com.netflix.hystrix.exception.HystrixRuntimeException: UserClient#getUserInfo(long) timed-out and no fallback available.
...

可以看到,应用启动后,RocketMQ 消费者就开始消费,但是 Ribbon 还没有初始化完成。

有人说是要开启 Ribbon 饥饿加载,看了下服务配置,已经开启了。

ribbon.eager-load.enabled=true
ribbon.eager-load.clients=user-service

Ribbon 自动配置

package org.springframework.cloud.netflix.ribbon;

@Configuration
public class RibbonAutoConfiguration {

    @Autowired
    private RibbonEagerLoadProperties ribbonEagerLoadProperties;

    @Bean
    @ConditionalOnProperty("ribbon.eager-load.enabled")
    public RibbonApplicationContextInitializer ribbonApplicationContextInitializer() {
        return new RibbonApplicationContextInitializer(springClientFactory(),
                ribbonEagerLoadProperties.getClients());
    }
}

@ConfigurationProperties(prefix = "ribbon.eager-load")
public class RibbonEagerLoadProperties {
    private boolean enabled = false;
    private List<String> clients;
}

public class RibbonApplicationContextInitializer
        implements ApplicationListener<ApplicationReadyEvent> {

    protected void initialize() {
        if (clientNames != null) {
            for (String clientName : clientNames) {
                // 初始化 IClientConfig
                this.springClientFactory.getContext(clientName);
            }
        }
    }

    @Override
    public void onApplicationEvent(ApplicationReadyEvent event) {
        // 在 ApplicationReadyEvent 事件发起后初始化 RibbonApplicationContext
        initialize();
    }

}

package org.springframework.cloud.openfeign.ribbon;


public class LoadBalancerFeignClient implements Client {

    @Override
    public Response execute(Request request, Request.Options options) throws IOException {
        try {
            URI asUri = URI.create(request.url());
            String clientName = asUri.getHost();
            URI uriWithoutHost = cleanUrl(request.url(), clientName);
            FeignLoadBalancer.RibbonRequest ribbonRequest = new FeignLoadBalancer.RibbonRequest(
                    this.delegate, request, uriWithoutHost);

            // 如果没有初始化 IClientConfig,要先初始化 IClientConfig
            // 初始化 IClientConfig 耗时比较长
            IClientConfig requestConfig = getClientConfig(options, clientName);
            return lbClient(clientName)
                    .executeWithLoadBalancer(ribbonRequest, requestConfig).toResponse();
        }
        catch (ClientException e) {
            IOException io = findIOException(e);
            if (io != null) {
                throw io;
            }
            throw new RuntimeException(e);
        }
    }
}

Spring getSingleton()

    protected Object getSingleton(String beanName, boolean allowEarlyReference) {
        Object singletonObject = this.singletonObjects.get(beanName);
        if (singletonObject == null && isSingletonCurrentlyInCreation(beanName)) {
            synchronized (this.singletonObjects) {
                singletonObject = this.earlySingletonObjects.get(beanName);
                if (singletonObject == null && allowEarlyReference) {
                    ObjectFactory<?> singletonFactory = this.singletonFactories.get(beanName);
                    if (singletonFactory != null) {
                        singletonObject = singletonFactory.getObject();
                        this.earlySingletonObjects.put(beanName, singletonObject);
                        this.singletonFactories.remove(beanName);
                    }
                }
            }
        }
        return singletonObject;
    }

存在锁 synchronized (this.singletonObjects)因此 Ribbon 的初始化是串行的。

解决方案

因为 Ribbon 的初始化是串行的,所以在初始化完成之前要禁止如下两项:

  • (1) 定时任务的执行。
  • (2) RocketMQ 消费者进行消息消费。

1. 延迟执行定时任务

(1)新增 ScheduleDisableAspect 切面,用来拦截 @Scheduled 方法。

@Component
@Aspect
@RequiredArgsConstructor
public class ScheduleDisableAspect {
    private static volatile boolean enable = false;

    @Around("@annotation(org.springframework.scheduling.annotation.Scheduled)")
    public Object beforePointCut(ProceedingJoinPoint pjp) throws Throwable {
        // 先看看是否被禁用,如果被禁用的话,则直接拦截
        if (!enable) {
            return null;
        }

        return pjp.proceed();
    }

    public static void enable() {
        enable = true;
    }
}

(2)在 SpringApplicationRunListenerrunning() 方法中打开定时任务的开关,让其可以执行。

package com.example.demo;

@Slf4j
public class ScheduleEnableRunListener implements SpringApplicationRunListener {

    // Spring 容器实例化使用
    public MySpringApplicationRunListener(SpringApplication application, String[] args) {
    }

    @Override
    public void running(ConfigurableApplicationContext context) {
        ScheduleDisableAspect.enable();
    }
}

(3)在 META-INF/spring.factories 配置:

org.springframework.boot.SpringApplicationRunListener=\
com.example.demo.ScheduleEnableRunListener

因为 ScheduleEnableRunListener 会在 org.springframework.cloud.netflix.ribbon.RibbonApplicationContextInitializer#onApplicationEvent 之后执行,所以在 Ribbon 初始化完成之后才会执行定时任务。

执行顺序的测试,可以参考:SpringApplicationRunListener 的使用

2. 延迟启动 RocketMQ 消费者

(1)去掉 org.apache.rocketmq.spring.autoconfigure.ListenerContainerConfiguration#registerContainer 中的如下代码。

服务启动时出现大量 Hystrix timed-out 第二次排查,第2张

(2)去掉 org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer#start 中的逻辑,将其拷贝到 doStart() 方法中。

服务启动时出现大量 Hystrix timed-out 第二次排查,第3张

(3)在 SpringApplicationRunListenerrunning() 方法中调用 container.doStart() 方法启动 RocketMQ 消费者。

package org.apache.rocketmq.spring.autoconfigure;

public class ConsumerDelayStartApplicationRunListener implements SpringApplicationRunListener {
    private final static Logger log = LoggerFactory.getLogger(ConsumerDelayStartApplicationRunListener.class);

    public ConsumerDelayStartApplicationRunListener(SpringApplication application, String[] args) {
    }

    @Override
    public void running(ConfigurableApplicationContext context) {
        log.info("ConsumerDelayStartApplicationRunListener.running executed");
        context.getBeansOfType(DefaultRocketMQListenerContainer.class).forEach((beanName,) -> {
            if (!container.isRunning()) {
                try {
                    log.info("beanName:{}, topic:{}, consumerGroup:{}.start() begin", beanName,.getTopic(),.getConsumerGroup());
                   .doStart();
                    log.info("beanName:{}, topic:{}, consumerGroup:{}.start() end", beanName,.getTopic(),.getConsumerGroup());
                } catch (Exception e) {
                    log.error("Started failed. {}",, e);
                    throw new RuntimeException(e);
                }
            }
        });
    }
}

(4)在 META-INF/spring.factories 配置:

org.springframework.boot.SpringApplicationRunListener=\
org.apache.rocketmq.spring.autoconfigure.ConsumerDelayStartApplicationRunListener

提示:因为定时任务的配置和 MQ 消费者的配置不在同一个项目,所以需要在不同的项目中进行配置。

至此,解决了项目启动时的 Hystrix timed-out 问题。

参考

  • Spring Cloud实战小贴士:Ribbon的饥饿加载(eager-load)模式

https://www.xamrdz.com/backend/3he1921021.html

相关文章: