360 likes | 477 Vues
This project involves developing a distributed data mining system in Java. The group members are Wang Chunsheng, Lin Junfu, and Wang Huifen. The project includes tasks such as client/server socket programming, multi-thread programming, data chunks maintenance, synchronization mechanism, failure handling, backup server logic, RMI mechanism, GUI system infrastructure, and more.
E N D
Distributed Data Mining System in Java Group Member 王春笙,林俊甫,王慧芬
Overview of Project • Project participants • 王春笙,林俊甫,王慧芬
Project Programming Tasks • D92725002 林俊甫 • Polling and reply Multicast between client and server • Client/Server Socket programming • Client dynamic join and leave mechanism • Multi-thread programming • Synchronization mechanism • Data chunks maintenance and dispatching mechanism • Client/Server communication link control
Project Programming Tasks(cont’d) • Client failure handling • Reassign backup server, if failure client is backup • Restore failure client works (with 王春笙) • Server failure handling • Backup Server designate mechanism and logic design • RMI mechanism (with 王春笙) • Basic GUI
System Infrastructure • System diagram Client Client Client ... LAN Mining data chunk Mining result Server/Coordinator
Basic Operation Time Time Client 1. Polling on port 4444 Group 230.0.0.1@: who is server? Server Listen multicast Group query and reply Server found; Connect to the Server 2. Servername: I am the server Fork thread to Handle client connection 3. Connect to <servername, port 4445> Receive server’s Instruction, ivoke RMI to get file chunk 4. Client do: filechunk# Wait for client’s Processed result, Order client to get Another file chunk 5. ok 6. Client do: next filechunk# 7….. 8….. ….
Port Assignment • Port 4444: for multicast • Port 4445: for TCP/IP socket connection • Port 4446: for RMI services
Finding A Server • Once a client start up, it will query periodically every 3 sec. over the multicast group 230.0.0.1 port 4444 by sending 1 byte string “@” to locating the server host. • Once a server start up, it will fork a thread to dealing with the query
File Dispatching • Server maintain a file chunk pool . • Server will find a available file chunk for client, set it to 1 and order client to get this file chunk by RMI file chunk will be update to 2 when client return result. • Recovery: When server detects client’s link-broken, it will restore file chunk allocate to client to 0. • File chunk class is declared as Serializable for RMI message passing to backup server • File chunk class use Synchronization for concurrent control FileChunks ………… -1: empty, 0: available, 1: using, 2:used
Backup Server Selection • Server maintains and assigns unique id for each individual client. • Unique id is incremented as serial number. • Client with smallest id is assigned as backup server • When client failure, server will check if it is the backup server to restart the selection process or not.
Nodes Maintenance • Server maintain connected client’s records in an ArrayList • ArrayList is compound with class Nodes, which records client’s detail information. Key Value ArrayList: ht Id Address Port Work on Status Nodes
RMI Services • RMI services is written in independent program because server and client (which acts as backup server) will use it. • RMI services provides: • Backup server data to backup-server. • Get file chunk from server • Return mining result to server • Receive nodes information from server
Client Failure • Server’s action took: • Recovery • Reassignment • Redo backup server selection if failure nodes is backup • Client’s action • Do nothing except one is told by server to act as backup
Server Failure Client A Time Server S Time Client B Time 1.A is told by S that It is the backupA invoke RMI to get all Server data A: Do backup Server run backup Selection choose A As backup 1. B receives instruction as discuss before RMI Get file Client do # RMI reply 2. A periodically Get server services, File chunk data Client do # do reply do reply 2. Comm.Link Broken is detected, multicast query who is the server now? 3. Comm.link broken Is detected, start ServerAction class X X Server Crash 4. Create server Socket at 4445, fork thread To listen to query And wait for connection B Polling @: who is server? 3. B know A is the backup, re-connect to A A reply: I am the server Connect to A:4445
Server/Client Life Cycle Server Client evolve Server Normal/Abnormal Termination Normal/Abnormal Termination
Project Programming Tasks • D91725001 王春笙 • Web log file preprocessing and separating • Web pages traversal sequences parsing • Page items transferring and mapping • Web pages sequential patterns mining • Mining results maintenance • RMI mining results transfer • Mining results lookup and display
Project Programming Tasks(cont’d) • Backup mechanism • Separate thread backup server files and memory data • Restore failure client works (with 林俊甫) • RMI mechanism (with 林俊甫) • GUI global states refreshment • System integration • Testing and debugging
Web Log File Format • User IP • Date • Time • Web pages URL
Web File Preprocessing • Select *.htm and *.html pages • First sort by user ID • Second sort by time • Pages sequences separated by time • more than 30 seconds
Chunk Data Files • Part*.ppp • Items.ppp 6023 2 1 1 2 8 6024 1 1 206 6025 7 1 1 1 1 1 1 1 2 5 17 18 19 20 11 6026 3 1 1 1 144 145 338 6027 2 1 1 2 9 6028 3 1 1 1 2 8 3 /~visualdep/htm/p5b.htm 168 /~businessdep/student/picture.html 169 /~comedu/inde.htm 170 /~account/91tuition.htm 171 /~stuaffair/life/procedure-17.htm 172 /~stuaffair/life/procedure-25.htm 173
Apriori algorithm • 1:find all L1 • 2:generate C2 from L1 • 3:count C2 and find all L2 • 4:k=3 • 5:generate & prune Ck from Lk-1 • 6:count Ck and find all Lk • 7:if Lk not empty then k++, goto 5
Apriori algorithm (cont’d) • join phase:s1 join s2 if s1(drop first) = s2(drop last) • s1 join s2 => • prune phase:delete a k candidate if any k-1 sub sequence not large • C & L are stored in hash data structure
Mining Result Display • Client frequent patterns • Web page ID • Support • Saved as *.pppl files • Client frequent patterns • Web page ID • Support • Web page name
Backup Mechanism • When backup server selected, that client start a backup thread • Backup thread loop every 0.5 second • RMI data transfer • Chunk data file(part*.ppp,items.ppp) • Client information • File chunk information • determine MaxID and set “in use” to “available” • Frequent patterns information
System Integration • Java class integration • Server component • Client component • Data mining component • GUI component • Testing • Debugging
Project Programming Tasks • D92725001 王慧芬 • Graphical User Interface • Since this is a system working on data mining task in a distributed way, its GUI provides four panels: • A system console • A result window • A connection table • A graphical network configuration
GUI • The system console shows how system proceeds
GUI (cont’d) • The result window displays the progress and results of data mining
GUI (cont’d) • A connection table lists all of the on-line client connection information
GUI (cont’d) • A connection table consists of 5 fields • NO:client-server connection id • IP address:client’s IP address • Port:client’s port number • Status:connection status, it could be • 0: offline 1: online • 2: file transfer from server to client • 3: client is doing data mining • 4: client returns value back to server if data mining finished • 5: client is doing the backup and data mining at the same time • # chunk works on:if data mining and backup, it indicates the chuck number that the connection works on
GUI (cont’d) • A graphical network configuration follows the connection table to depict the dynamic network configuration
GUI (cont’d) • In the dynamic network configuration, we use different client GIFs to express the status: • Offline On-line • Data mining • Backup and mining
GUI interface • mw.showMsg() • provided by GUI for server/client module to show the console message • mw.showResultString() • provided by GUI for server/client module to show the results of data mining • Connection table • modified by server/client module for connection information • read by GUI every 0.01 second to depict the dynamic network configuration
GUI design • Java swing is used to generate label, text, scrollbar, and table, etc.. • Java AWT 2D painting is used to form the animation of the connection lines in the dynamic configuration panel • ‘Photo Impact’ and ‘GIF animator’ are used to generate the node icons • EasyRGB used to tune the color harmonies.
GUI design (cont’d) • A new thread is forked from the GUI task to work on the animation of the connection lines in the dynamic configuration panel, • to read the table every 0.03 second and to show the connection status with a moving ball.
Installation • 以執行一個 server,兩個client為例 • 建立三個資料夾,此三資料夾Ser(Server),Cli(Client1),Cli2(Client 2) • 將附檔解壓至Ser資料夾,此資料夾內要下載weblog10.zip檔,並解壓 • 將附檔解壓至 Cli 與Cli2的空資料夾 • 開啟二個dos視窗(1,2號視窗),進入Ser資料夾 • 開啟三個dos視窗(3,4,5號視窗),3,4號進入Cli資料夾,5號進入Cli2資料夾 • 1號視窗執行 compile.bat 批次檔,再執行 rmi.bat • 2號視窗執行 server.bat 批次檔 • 3號視窗執行 compile.bat 批次檔,再執行 rmi.bat • 4號視窗執行 client.bat批次檔 • 5號視窗執行 compile.bat批次檔,再執行 client.bat批次檔