WWW21-Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

FedScale用的就是这篇文章的数据做了个异构性的数据生成。

数据的Ethic consideration处理可以参考这篇。

摘要：

异构性对FL的训练过程有影响，例如导致设备训练的时候unavailable或者不能够上传模型的更新。
本文：对FL中设备异构性做的第一个empirical study。收集了从136k手机收集到的能够真实的反应真实世界异构性的数据。建立了一个heterogeneity-aware的FL平台，这个平台符合标准的FL，但是考虑了异构性。
比较了当前sota的FL算法在有异构和无异构上的表现。结果显示异构性会导致FL中non-trivial的性能下降，9.2%的精度下降，2.32x长的训练时间，undermined fairness。
分析了潜在的影响银子发现device failure和participant bias是两个导致performance degradation的因子。

使用框架LEAF实现。

Introduction

当前的方法例如FedAvg、Structured Updates和q-FedAvg常常用模拟的方法去测试。然而，模拟的时候常常过度假设，所有的设备在训练的时候都是available的，硬件的配置都是一个样的。

异构性分为：

硬件异构性。（不同的CPU、RAM和电池寿命）
状态异构性。（CPU的busy/free，服务器的稳定或者不稳定的网络连接）等等。

主要做的事：

开发了一个和当前主流的FL范式符合的holistic platform，第一次加速了在设备的异构性下的FL算法的开发。收集了136k手机用户在一个商业输入法app（IMA）上的额数据，然后把这个数据plug在开发的平台上模拟设备的状态和硬件异构性。
比较了当前sota的FL算法在有异构和无异构上的表现。采用的数据集是4个经典的FL任务下的四个数据集（三个常用的数据集和一个IMA的数据集）。
发现：对accuracy , training time和fairness都有影响。对FedAvg、压缩算法、聚合算法都有影响，例如异构性会导致q-FedAvg产生处理fairness的问题，压缩算法很难起作用，最坏的时候传输时间被提升了3.5x。
潜在因子的分析：（1）DevIce Failure，11.6%的设备不能成功上传模型更新，会导致模型收敛变慢，浪费宝贵的硬件资源。（2）Participants bias，模型收敛的时候，more than 30%的设备从没参加学习的过程，模型的收敛被active的devices支配。（30%的device占据了81%的计算）。

Background

异构性对FL十分重要。很多算法处理异构性的时候并不严谨，例如：

FedProx通过允许每个参与者perform一系列的工作，模拟硬件的异构性，但是每个硬件的capability 是随机设置的，硬件状态的改变也没有考虑。
FedCS通过基于设备的资源条件管理设备，允许服务器聚集许多的设备更新，但是假定网络是稳定的，不会产生拥塞，在5-500秒的范围内随机设置了训练的时间。

总之，人家有问题

THE MEASUREMENT APPROACH（度量的方法）

整个的实验流程：

数据集

Device state traces数据：

2020年1月31日开始一周的数据。
136k的设备的轨迹，包含180million的state entries和111GB的存储。

轨迹只有在available interval才有用。

算力数据：

超过1000种设备，进行聚类，具体是通过mapping:

(1) The total device models are first mapped to the device models profiled by AI-Benchmark, a comprehensive AI performance benchmark. For a few device models that AI-Benchmark does not cover, we make a random mapping. It reduces the number of device models to 296.
(2) The remaining device models are then mapped to what we afford to profile -> three representative and widely-used device models (Samsung Note 10, Redmi Note 8, and Nexus 6).
(3) To profile these devices, we run on-device training using the open-source ML library DL4J [15] and record their training time for each ML model used in our experiments.

通信数据（使用志愿者模拟）：

we recruit 30 volunteers and deploy a testing app on their devices to periodically obtain (i.e., every two hours) the downstream/upstream bandwidth between the devices and a cloud server.

benchmark：

三个合成数据集（Reddit、Femnist和Celeba）。Femnist和Celeba图像分类，Reddit和MType是next-word预测任务，分别使用CNN和LSTM模型。
IMA的真实输入数据集。

平台的模拟

根据Google的报告配置了FL system：

具体的平台设置详见原文。

实验的设置

算法的配置：

基本的算法：FedAvg
聚合算法：q-FedAvg、FedProx。
压缩算法：Structured Updates、Gradient Dropping (GDrop)、SignSGD。

Metric：

Convergence accuracy
Training time/round
Compression ratio
Variance of accuracy

实验环境：

实验结果

Impacts on Basic Algorithm’s Performance

异构性对FL精度的下降影响挺大。
异构性会明显的降低FL训练过程，会增加训练时间和训练轮次。

Impacts on Advanced Algorithms’ Performance

FedProx允许设备依据系统资源做训练工作。也对local optimization增加了一个proximal term。由于q-FedAvg和FedProx的优化目标不同，分开比较，使用的baseline为FedAvg。

q-FedAvg的结果如Table 3所示。

FedProx的结果如图Figure-5所示。

发现：（1）q-FedAvg that is supposed to address fairness issues is less effective in ensuring fairness under heterogeneity-aware settings.
发现：（2）FedProx is less effective in improving the training process with heterogeneity considered

梯度压缩算法：

Structured Updates、Structured Updates（GDrop）、SignSGD。（具体的设置看原文）

Heterogeneity introduces a similar accuracy drop to compression algorithms as it does to the basic algorithm
Gradient compression algorithms can hardly speed up the model convergence under heterogeneity-aware settings

影响因子的分析

主要分析了两种异构性：

（1）选中的device由于一些原因上传模型更新失败。称作device failure。

（2）成功上传的仍可能对global model产生biased的contribution。称作participant bias。

探究两种异构性（state heterogeneity）和（hardware heterogeneity）的各自的影响。

Both state heterogeneity and hardware heterogeneity slow down the model convergence.
State heterogeneity is more influential than hardware heterogeneity on the model accuracy.

Device Failure

(1) Network failure
(2) Interruption failure
(3) Training failure

回答问题：

(1) how often the devices may fail and what the corresponding reasons for the failure are;
(2) and which type of heterogeneity is the major factor.

Heterogeneity introduces non-trivial device failure even when an optimal deadline setting is given.

Hardware heterogeneity leads to more device failure than state heterogeneity.

Participant

Participant bias refers to the phenomenon that devices do not participate in FL with the same probability. It can lead to different contributions to the global model, thus making some devices underrepresented.

The computation loads get more uneven under heterogeneityaware settings.
The number of inactive devices increases significantly under heterogeneity-aware settings.
Up to 30% devices have not participated in FL process when the global model reaches the target accuracy under heterogeneityaware settings.

State heterogeneity is more responsible for participant bias

SHEN Haiyang

SHEN Haiyang is the best!