1 / 17

Building Your Own Web Spider

Who am I. Graduate: Computer Systems Technology

Albert_Lan
Télécharger la présentation

Building Your Own Web Spider

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Building Your Own Web Spider Thoughts, Considerations and Problems

    2. Who am I

    3. Why Discuss This?

    4. What Will We Talk About?

    5. Why Build a Spider?

    6. Current Products

    7. Design Considerations aka Spider Dos and Donts

    8. Dos and Donts #2

    9. Dos and Donts #3 & #4

    10. Dos and Donts #5 & #6

    11. Hurdles

    12. Hurdles #2

    13. Hurdles #3

    14. Simple Spider Sample

    15. Simple Spider Sample Continued def getLinks ( start_page, page_data ) : url_list = [] anchor_href_regex = '<\s*a\s*href\s*=\s*[\x27\x22]?([a-zA-Z0-9:/\\\\._-]*)[\x27\x22]?\s*' urls = re.findall(anchor_href_regex,page_data) for url in urls : url_list.append(urlparse.urljoin( start_page, url )) return url_list def getPage ( url ) : page_data = urllib.urlopen(url).read() return page_data

    16. Simple Spider Sample Continued (2) if __name__ == '__main__' : end_results = [] recursion_count = 0 try: page_array = [sys.argv[1]] except IndexError: print 'Please provide a valid url.' sys.exit() while recursion_count < RECURSION_LEVEL: results = [] for current_page in page_array: page_data = getPage( current_page ) link_list = getLinks(current_page, page_data) for item in link_list: if item.find( current_page ) != -1: results.append( item ) results = list(set(results)) page_array = results end_results += results end_results = list(set(end_results)) recursion_count += 1 for item in end_results: print item

    17. Q & A

More Related