spark join shuffle 数据文件的读取
我们看下在shuffle过程中数据文件的读取过程中调用的类对象
// 下面就是对这个shuffler中的分片数据进行读取并进行相关的aggregate操作了
val blockFetcherItr = new ShuffleBlockFetcherIterator(
context,
blockManager.shuffleClient,
blockManager,
mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
// Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024)
上面类就是负责调用远程和本地的shuffle分片的数据来的
在类ShuffleBlockFetcherIterator.initialize 方法中
private[this] def initialize(): Unit = {
// Add a task completion callback (called in both success case and failure case) to cleanup.
context.addTaskCompletionListener(_ => cleanup())
// Split local and remote blocks.
// 拿到要远程拉取的数据信息了
val remoteRequests = splitLocalRemoteBlocks()
// Add the remote requests into our queue in a random order
fetchRequests ++= Utils.randomize(remoteRequests)
// Send out initial requests for blocks, up to our maxBytesInFlight
// 从队列里面拿出请求任务进行请求了
fetchUpToMaxBytes()
val numFetches = remoteRequests.size - fetchRequests.size
logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))
// Get Local Blocks
// 拉取本地数据了
fetchLocalBlocks()
logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
}
上面可以看到数据的拉取分成远程和本地两种类型的数据。下面先对要拉取的数据进行分类,分成本地和远程的
private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
// Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
// smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
// nodes, rather than blocking on reading output from one node.
val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
logDebug("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize)
// Split local and remote blocks. Remote blocks are further split into FetchRequests of size
// at most maxBytesInFlight in order to limit the amount of data in flight.
val remoteRequests = new ArrayBuffer[FetchRequest]
// Tracks total number of blocks (including zero sized blocks)
var totalBlocks = 0
for ((address, blockInfos) <- blocksByAddress) {
// 一个一个地址去拉取数据了
totalBlocks += blockInfos.size
if (address.executorId == blockManager.blockManagerId.executorId) {
// Filter out zero-sized blocks
// 如果这个数据是在本地的
localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
numBlocksToFetch += localBlocks.size
} else {
// 这里就是一些非本地的数据了
val iterator = blockInfos.iterator
var curRequestSize = 0L
var curBlocks = new ArrayBuffer[(BlockId, Long)]
while (iterator.hasNext) {
val (blockId, size) = iterator.next()
// Skip empty blocks
if (size > 0) {
curBlocks += ((blockId, size))
remoteBlocks += blockId
numBlocksToFetch += 1
curRequestSize += size
} else if (size < 0) {
throw new BlockException(blockId, "Negative block size " + size)
}
if (curRequestSize >= targetRequestSize) {
// Add this FetchRequest
// block 是 ShuffleBlockId
remoteRequests += new FetchRequest(address, curBlocks)
curBlocks = new ArrayBuffer[(BlockId, Long)]
logDebug(s"Creating fetch request of $curRequestSize at $address")
curRequestSize = 0
}
}
// Add in the final request
if (curBlocks.nonEmpty) {
remoteRequests += new FetchRequest(address, curBlocks)
}
}
}
logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
// 返回要远程拉取的数据了
remoteRequests
}
可以看到上面根据当前的 shuffleBlockId 的执行进程executorId是否和当前的一致,如果是一致,说明是本地数据,否则是远程数据。
当分片完后,远程数据就从队列中拿出任务进行执行拉取了
private def fetchUpToMaxBytes(): Unit = {
// Send fetch requests up to maxBytesInFlight
while (fetchRequests.nonEmpty &&
(bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= maxBytesInFlight)) {
// 通过这里进行限流请求了
sendRequest(fetchRequests.dequeue())
}
}
下面就是具体的远程拉取方法
private[this] def sendRequest(req: FetchRequest) {
logDebug("Sending request for %d blocks (%s) from %s".format(
req.blocks.size, Utils.bytesToString(req.size), req.address.hostPort))
bytesInFlight += req.size
// so we can look up the size of each blockID
val sizeMap = req.blocks.map { case (blockId, size) => (blockId.toString, size) }.toMap
val blockIds = req.blocks.map(_._1.toString)
// 去这些地址拉取数据了,同时注意block对象是 ShuffleBlockId 里面包含着当前请求的是那个分片数据
// 在拉取的时候,还要对block块数据进行分片
val address = req.address
// NettyBlockTransferService 和 ExternalShuffleClient
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
new BlockFetchingListener {
override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
// Only add the buffer to results queue if the iterator is not zombie,
// i.e. cleanup() has not been called yet.
if (!isZombie) {
// Increment the ref count because we need to pass this to a different thread.
// This needs to be released after use.
buf.retain()
results.put(new SuccessFetchResult(BlockId(blockId), address, sizeMap(blockId), buf))
shuffleMetrics.incRemoteBytesRead(buf.size)
shuffleMetrics.incRemoteBlocksFetched(1)
}
logTrace("Got remote block " + blockId + " after " + Utils.getUsedTimeMs(startTime))
}
override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
results.put(new FailureFetchResult(BlockId(blockId), address, e))
}
}
)
}
可以看到主要使用NettyBlockTransferService 和 ExternalShuffleClient 两种客户端进行数据的拉取。
本地数据则如下
private[this] def fetchLocalBlocks() {
val iter = localBlocks.iterator
while (iter.hasNext) {
val blockId = iter.next()
try {
// 拉取数据了
val buf = blockManager.getBlockData(blockId)
shuffleMetrics.incLocalBlocksFetched(1)
shuffleMetrics.incLocalBytesRead(buf.size)
buf.retain()
results.put(new SuccessFetchResult(blockId, blockManager.blockManagerId, 0, buf))
} catch {
case e: Exception =>
// If we see an exception, stop immediately.
logError(s"Error occurred while fetching local blocks", e)
results.put(new FailureFetchResult(blockId, blockManager.blockManagerId, e))
return
}
}
}
使用blockManager对象进行拉取数据
override def getBlockData(blockId: BlockId): ManagedBuffer = {
if (blockId.isShuffle) {
// 当是shuffle数据时
shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId])
} else {
val blockBytesOpt = doGetLocal(blockId, asBlockResult = false)
.asInstanceOf[Option[ByteBuffer]]
if (blockBytesOpt.isDefined) {
val buffer = blockBytesOpt.get
new NioManagedBuffer(buffer)
} else {
throw new BlockNotFoundException(blockId.toString)
}
}
}
然后当前数据是 shuffle数据,所以调用了shuffleBlockResolver对象,然后实现类为FileShuffleBlockResolver和IndexShuffleBlockResolver 两个,现在查看FileShuffleBlockResolver 对象。
override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
// 拿到数据文件回来了
val file = blockManager.diskBlockManager.getFile(blockId)
// 返回这个文件读取对象了
new FileSegmentManagedBuffer(transportConf, file, 0, file.length)
}
可以看到,该方法先去拿到该分片的文件,然后创建FileSegmentManagedBuffer 对象。
具体的 DiskBlockManager实现搜索文件为
def getFile(filename: String): File = {
// filename 包含着 shuffleid_index_part
var attempts = 0
val numLocalDirs = localDirs.length
val maxAttempts = numLocalDirs
val subDirsMap = new mutable.HashMap[Int, Array[File]]()
subDirs.zipWithIndex.foreach(s => subDirsMap.put(s._2, s._1))
// Figure out which local directory it hashes to, and which subdirectory in that
// 文件名称的hash
val hash = Utils.nonNegativeHash(filename)
// hash 到这些目录下面
var dirId = hash % subDirsMap.size
// 子目录下面
var subDirId = (hash / subDirsMap.size) % subDirsPerLocalDir
var dir: File = null
while (dir == null) {
attempts += 1
if (attempts > maxAttempts) {
/*
throw new IOException("Failed to create local dir after " +
maxAttempts + " attempts!")
*/
logError("Failed to create local dir in root " + dirId + " after " +
maxAttempts + " attempts!")
subDirsMap.remove(dirId)
if (subDirsMap.size == 0) {
throw new IOException("Failed to create local dir after try in all root dir")
}
// 有多个盘下面可以存储数据,所以这里这样操作了
dirId = hash % subDirsMap.size
subDirId = (hash / subDirsMap.size) % subDirsPerLocalDir
}
try {
// Create the subdirectory if it doesn't already exist
dir = subDirs(dirId).synchronized {
val old = subDirs(dirId)(subDirId)
if (old != null) {
new File(old, filename)
} else {
val newDir = new File(localDirs(dirId), "%02x".format(subDirId))
if (newDir.exists() || newDir.mkdir()) {
subDirs(dirId)(subDirId) = newDir
new File(newDir, filename)
}
else {
logWarning(s"Failed to create local dir in $newDir.")
null
}
}
}
} catch { case e: SecurityException => dir = null; }
}
dir
}
从这里可以看到,当是spark的数据文件存放在多硬盘时,是通过hash到多个目录下面进行数据文件的存放的。
然后FileSegmentManagedBuffer 文件就就是纯粹的nio文件的读取
@Override
public ByteBuffer nioByteBuffer() throws IOException {
FileChannel channel = null;
try {
channel = new RandomAccessFile(file, "r").getChannel();
// Just copy the buffer if it's sufficiently small, as memory mapping has a high overhead.
if (length < conf.memoryMapBytes()) {
ByteBuffer buf = ByteBuffer.allocate((int) length);
channel.position(offset);
while (buf.remaining() != 0) {
// 不断读取到直接内存中去
if (channel.read(buf) == -1) {
throw new IOException(String.format("Reached EOF before filling buffer\n" +
"offset=%s\nfile=%s\nbuf.remaining=%s",
offset, file.getAbsoluteFile(), buf.remaining()));
}
}
buf.flip();
return buf;
} else {
// 直接文件的读取了
return channel.map(FileChannel.MapMode.READ_ONLY, offset, length);
}
从上面可以看到,spark的数据文件存放在多个硬盘的原理。思想是通用的,存放其它的临时文件也一样。
总结
- 读取指定shuffleid的part分片的数据
- 通过拉取该分片数据的mapstatus信息,存放在那些节点当中
- 然后对这数据进行分类是远程还是本地数据
- 如果是远程数据就通用netty进行拉取,本地数据就读取数据文件
- 当是本地数据时,就通过 shuffleid + mapid + part 定位到具体存放的数据文件然后通过nio方式读取