Web Crawler 101

Presented by 劉純睿 a.k.a 阿吉

Aboutn Me

aji.tw

Biggest web crawler company

What is a Web Crawler?

"Crawler" is a generic term for any program (such as a robot or spider) used to automatically discover and scan websites by following links from one webpage to another. (reference)
## Legal Issues
### `robots.txt` > A robots.txt file is a file at the root of your site that indicates those parts of your site you don’t want accessed by search engine crawlers [(reference)](https://support.google.com/webmasters/answer/6062608?hl=en)
## Programming Language

What is the ideal program language for a web-crawler?

What's the best way of scraping data from a website?

Crawler Frameworks

### Some great Python frameworks + [Scrapy](https://github.com/scrapy/scrapy) + [Selenium](https://github.com/SeleniumHQ/selenium/tree/master/py) + [Mechanize](https://github.com/python-mechanize/mechanize) + [RoboBrowser](https://github.com/jmcarp/robobrowser) + [Grab](https://github.com/lorien/grab) + [Frontera](https://github.com/scrapinghub/frontera)
## The Basics
### HTTP Headers ``` Accept: text/html,application/xhtml+xm…plication/xml;q=0.9,*/*;q=0.8 Accept-Encoding: gzip, deflate, br Accept-Language: en-US,en;q=0.5 Connection: keep-alive Cookie: __cfduid=df5348086ef38be7990d6…d=GA1.2.1406394816.1522116242 Host: static.tutsplus.com User-Agent: Mozilla/5.0 (X11; Ubuntu; Linu…) Gecko/20100101 Firefox/59.0 ```
## HTTP Status Code + `1xx` Informational responses + `2xx` Success + `3xx` Redirection + `4xx` Client errors + `5xx` Server errors
## Data Fromat + HTML + XML + JSON
### Session vs Cookie
### IFrame
### AJAX
### Browser Consoles ![chrome](https://raw.githubusercontent.com/alrra/browser-logos/master/src/chrome/chrome_256x256.png) ![firefox](https://raw.githubusercontent.com/alrra/browser-logos/master/src/firefox/firefox_256x256.png) ![safari](https://raw.githubusercontent.com/alrra/browser-logos/master/src/safari/safari_256x256.png)
## Manipulating DOM
### DOM Parsers | XPath | CSS Selector | |------------------------------------|-------------------| | //* | * | | //p | p | | //p/* | p > * | | //*[@id=’foo’] | #foo | | //*[contains(@class,’foo’)] |.foo` | | | //*[@title] | *[title] | | //p/*[0] | p > *:first-child | | //p[a] | Not possible | | //p/following-sibling::*[0] | p + * |
### Python Libraries + [PyQuery](https://pyquery.readthedocs.io/en/latest/) + [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) + [lxml](http://lxml.de/) + [cssselect](https://cssselect.readthedocs.io/en/latest/)
### 比美麗的湯更美麗 [PyCon Taiwan 2017 slides](https://aji.tw/slides/pycon2017)
## Anti-crawling
### (Infinite) redirection
### 404 Not Found
### 503 Service Unavailable
### 504 Gateway Timeout
### JS detection
## Anti-anti-crawling
## S***, I got banned
### HTTP Headers Checking
### Obfuscation ``` RKbW=function(){'RKbW';var _R=function(){return '=76'}; return _R();};function AIV(AIV_){function _A(AIV_){function ph(){return getName();}function AIV_(){}return ph();return AIV_}; return _A(AIV_);}DmZP='iew';_IX161 = 'assign';function zoQ(zoQ_){function ti(){return getName();};return ti();return 'zoQ'}function r2oe(){'return r2oe';return 'ad&'}_eloda = 'replace';F59s=function(){'return F59s';return 'p?m';};HP=function(){'return HP';return 'n';};function getName(){var caller=getName.caller;if(caller.name){return caller.name} var str=caller.toString().replace(/[\s]*/g,"");var name=str.match(/^function([^\(]+?)\(/);if(name && name[1]){return name[1];} else {return '';}}uM=function(){'return uM';return '9';};eG9=function(eG9_){'return eG9';return eG9_;};function kp(kp_){function _k(kp_){function o(){return getName();}function kp_(){}return o();return kp_}; return _k(kp_);}vD='1';BN=function(){'BN';var _B=function(){return 'r'}; return _B();};HALw='rum';_RZnE9 = 'href';o5y=function(o5y_){'return o5y';return o5y_;};function PH(){'return PH';return '.'}_BDkwZ = location;function w2(){'return w2';return '_'}KTI4=function(){'return KTI4';return '910';};_NUuAJ = window;wX=function(){'wX';var _w=function(){return 'd'}; return _w();};iyL=function(iyL_){'return iyL';return iyL_;};location.replace((function(){'return Q8mM';return '/fo'})()+HALw+PH()+AIV('Gs8')+F59s()+kp('nm')+(function(){'return njFH';return (function(){return 'd=v';})();})()+DmZP+eG9('th')+BN()+(function(){'return XD';return 'e'})()+r2oe()+zoQ('yKM')+wX()+RKbW()+'8&'+w2()+o5y('ds')+iyL('ig')+HP()+(function(){'return l26W';return '=6f'})()+uM()+(function(){'return by';return (function(){return '7';})();})()+KTI4()+vD);_NUuAJ['href']=(function(){'return Q8mM';return '/fo'})()+HALw+PH()+AIV('Gs8')+F59s()+kp('nm')+(function(){'return njFH';return (function(){return 'd=v';})();})()+DmZP; ``` Reverse Engineering ``` location.href=forum.php?mod=viewthread&tid=768&_dsign=6f979101 ```
### IP Banning ![Tor](https://upload.wikimedia.org/wikipedia/commons/1/15/Tor-logo-2011-flat.svg) Tor (The Onion Router)
### IP Banning (Cont'd) TCP Proxy + SOCKS + HTTP
### Captcha ![captcha](https://upload.wikimedia.org/wikipedia/commons/b/b6/Modern-captcha.jpg) OCR (Optical Character Recognition)
### No CAPTCHA reCAPTCHA ![reCAPTCHA](https://upload.wikimedia.org/wikipedia/commons/a/ad/RecaptchaLogo.svg) Human Computation (工人智慧)
## Live Demo
### Example: PTT - Gossiping + Requests + Selenium + RoboBrowser
### Requests ```python import requests url = 'https://www.ptt.cc/bbs/Gossiping/index.html' cookies = {'over18': 'yes'} response = requests.get(url, cookies=cookies) ```
### Requests (cont'd) ```python from requests import Session session = Session() session.cookies['over18'] = 'yes' url = 'https://www.ptt.cc/bbs/Gossiping/index.html' response = session.get(url) ```
### Selenium ```python from selenium import webdriver browser = webdriver.Firefox() url = 'https://www.ptt.cc/bbs/Gossiping/index.html' browser.get(url) element = browser.find_element_by_css_selector('button[name="yes"]') element.click() ```
### RoboBrowser ```python from robobrowser import RoboBrowser browser = RoboBrowser(history=True) url = 'https://www.ptt.cc/bbs/Gossiping/index.html' browser.open(url) form = browser.get_form(action='/ask/over18') form['from'].value = 'yes' browser.submit_form(form) ```
## Final Words + API first + Caching + Throttling + Hiding + Testing + Monitoring