280 likes | 410 Vues
A distributed Task Scheduler Optimizing Data Transfer Time. Taura lab. Kei Takahashi (56428). Task Schedulers. A system which distributes many serial tasks onto the grid environment Task assignments File transfers Many serial tasks can be executed in parallel
 
                
                E N D
A distributed Task SchedulerOptimizing Data Transfer Time Taura lab. Kei Takahashi (56428)
Task Schedulers • A system which distributes many serial tasks onto the grid environment • Task assignments • File transfers • Many serial tasks can be executed in parallel • Some constraints need to be considered • Machine availability • Data location
Data Intensive Applications • A computation using large data • Some gigabytes to petabytes • Natural language processing, data mining, etc. • A simple algorithm can extract useful general knowledge • A scheduler need additionally consider: • Reduction in data transfers • Effective placement of data replicas
An Example of Scheduling • The scheduler maps tasks on each machine Task t0 Requires : f0 Task t1 Requires : f1 Task t2 Requires : f1 Scheduler A B File f1 File f0 t0 A A t2 f0 f1 t2 t1 t1 t0 B B Shorter processing time
Related Work • Schedulers for data intensive applications • GrADS: predicts transfer time, but only uses static bandwidth value between two hosts • Efficient multicast • Topology-aware bandwidth simulation • PBS: scheduling parallel jobs to nodes with consodering network topology
Topology-aware Transfer • If a bandwidth topology map is given, file transfers can be more precisely estimated • Detecting congestions • By the estimation, more precise optimization becomes possible for the task scheduling Switch Congestion occurs Source Nodes Destination Nodes
Research Purpose • Design and implement a distributed task scheduler for data intensive applications • Predict data transfer time by using network topology map with bandwidth • Map tasks onto machine • Plan efficient data transfers
Input and Output • Given information: • File Locations and size • Network Topology • Output: • Task Scheduling: • Assignments of nodes to tasks • Transfer Scheduling: • From/to which host data are transferred • Limit bandwidth during the transfer if needed • The amount of data transfer changes depending on a task schedule
Agenda • Background • Purpose • Our Approach • Related Work • Conclusion
Our Approach • Task execution time is hard to predict in general • Schedule one task for each host at a time • Assign new tasks when a certain number of hosts have completed, or certain time passed • When the data size and bandwidth is known, file transfer time is predictable • Optimize data transfer time by using network topology and bandwidth information • Multicast • Transfer priorities
Problem Formulation • Final goal: minimizing the makespan • Immediate goal: minimizing the sum of time that takes before each task can start • More immediate goal: minimizing the sum of arrival time of every file on every node (filei, j : the j th file required by taski ) (filei, j : the j th file required by taski )
Algorithm • When some nodes are unschduled, • Create an initial candidate task schedule • For each candidate task schedule: • Decide priorities on each file transfer (including ongoing transfers) • Plan efficient file transfer schedule • Estimate the usage of each link, and find the most crowded link • Search for a better schedule whose most crowded link is less crowded(by using heuristics like GA, SA)
Transfer Priorities (1) • If several transfers share a link, they divide the bandwidth of that link • Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer • If two transfers are going on, the it is more efficient to transfer the smaller file first F0 50MBps F0 F0 100MBps 1GB 1GB 100MBps 100MBps 100MBps 50MBps F1 F1 F1 100MBps 1GB 1GB Transfer of F0 completes 10 seconds after, and F1 arrive 20 seconds after Both F0 and F1 arrive 20 second after
Transfer Priorities (2) • Objective: minimizing • Priorities are determined by these criteria: • A file needed by more tasks • A smaller file • A file with less replicas (sources) (filei, j : the j th file required by taski )
Transfer Planning • Transfers are planned from a file with higher priority (Greedy method) • The highest priority transfer can use as much bandwidth as possible • A transfer with less priority can use the rest of the bandwidth • The bandwidth for ongoing transfers are also reassigned
Transfer Priorities (example) • Task0, Task1, Task2 are scheduled • File0: 3GB, needed by task0 • File1: 2GB, needed by task0, task1, task2 • File2: 1GB, needed by task1 • Transferring priority: • File1 > File2 > File0 • File1 is transferred using multicast pipeline Task0 Task0 Task0 Task0 Task1 Task2 Task1 Task2 Task1 Task2 Task1 Task2 Total Bandwidth File1 File2 File0
Pipeline Multicast (1) • For a given schedule, it is known which nodes require which files • When multiple nodes need a common file, a pipeline multicast shortens transfer time(in the case of large files) • The speed of a pipeline broadcast is limited by the narrowest link in the tree • A broadcast can be sped up by efficiently using multiple sources
Pipeline Multicast (2) • The tree is constructed depth-first manner • Every related link is only used twice (upward/downward) • Since disk access is as slow as network, the disk access bandwidth should be also counted Destination Source
Multi-source Multicast • M nodes have the same source data; N nodes need it • For each link in the order of bandwidth: • If the link connects two nodes/switches which are already connected to the source node: → Discard the link • Otherwise: → Adopt the link (Kruskal's Algorithm: it maximizes the narrowest link in the pipelines) Pipeline 1 Discard this link Pipeline 2 Source Destination
Find Crowded Link • On a certain schedule, for each link, • list every transfer using that link • sum up transfer size • calculate ”rough transfer time” as follows:(total transfer size) / (bandwidth) • Find the longest “rough transfer time”
Improve the Schedule • After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments • We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.
Actual Transfers • After the transfer schedule has determined, the plan is performed as simulated • In a pipeline transfer, the bandwidth is the same through links • The bandwidth of the narrowest link is limited to the calculated value • When detecting a significant change in bandwidth, the schedule is reconstructed • The bandwidth is measured by using existing methods (eg. nettimer, bprobe)
Re-scheduling Transfers • When a file transfer has finished, transfer schedule is recalculated • Calculating the priority on each task • Assign bandwidth for each task • A new bandwidth value is assigned for each transfer, but the pipeline is not changed
Task Description • A user need specify required files, output files and dependencies for each task • In order to enable flexible task description, we provide scheduling API for script languages • Files are identified by URIs(ex. “abc.com:/home/kay/some/location”) • The scheduler analyses dependencies from filepath
Task Submission • A task is submitted by calling submit() with required filepath fs = sched.list_files("abc.com:/home/kay/some/location/**”) parsed_files = [] for fn in fs: new_fn = fn + ".parsed“ command = "parser " + fn + " " + new_fn sched.submit(command, [new_fn], []) parsed_files.append(new_fn) sched.gather(parsed_files, “abc.com:/home/kay/parsed/")
Conclusion • Introduced new scheduling algorithm • Predict transfer time by using network topology, and search for a better task schedule • Efficiently transfer files by limiting bandwidth • Dynamically re-scheduling transfers • Current status: • The implementation is ongoing
Publications • 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェクト.並列/分散/協調処理に関するサマーワークショップ(SWoPP2005),武雄,2005年8月. • 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェクト. 先進的計算基盤シンポジウム(SACSIS 2005),筑波,2005年5月.