比美麗的湯更美麗

pyquery

Presented by 劉純睿 a.k.a 阿吉

Aboutn Me

aji.tw

Outline

  1. Introduction
  2. The Basics
  3. PyQuery vs BeautifulSoup
  4. Digging deeper
  5. Practical Experience

Introduction

Installation

Using `pip`
```bash pip install pyquery ```
which includes:
```python lxml>=2.1 cssselect>0.7.9 ```
FAQ

StackOverflow: [pip install lxml error](https://stackoverflow.com/questions/5178416/pip-install-lxml-error)

Performance

Fast, because of:
+ [`lxml`](http://lxml.de/performance.html): based on `libxml2` + [`libxml2`](http://xmlsoft.org): written in `C`

Camel Case Alias

                            
next_all nextAll
prev_all prevAll
remove_attr removeAttr
has_class hasClass
add_class addClass
remove_class removeClass
toggle_class toggleClass
outer_html outerHtml
append_to appendTo
prepend_to prependTo
insert_after insertAfter
insert_before insertBefore
wrap_all wrapAll
replace_with replaceWith
replace_all replaceAll

Resource

The Basics

Categories

  1. Traversing
  2. Properties
  3. Attributes
  4. CSS
  5. HTML
  6. Manipulating
  7. pyquery specifics

Traversing

  • .parent
  • .prev
  • .next
  • .next_all
  • .prev_all
  • .siblings
  • .parents
  • .children
  • .closest
  • .contents
  • .filter
  • .not_
  • .is_
  • .find
  • .eq
  • .each
  • .map
  • .end

Properties

  • .length
  • .size

Properties (cont.)

In essence, they are identical.

```python class PyQuery(list): # ... @property def length(self): return len(self) def size(self): return len(self) ```

Attributes

  • .attr
  • .remove_attr

CSS

  • .height
  • .width
  • .has_class
  • .add_class
  • .remove_class
  • .toggle_class
  • .css
  • .hide
  • .show

HTML

  • .val
  • .html
  • .outer_html
  • .text

HTML (cont.)

  • .outer_html

>>> html = '''
<div>
    <p>Hello, PYCON TW!</p>
</div>
'''
>>> dom = PyQuery(html)
>>> print(dom.html())

    <p>Hello, PYCON TW!</p>

>>> print(dom.outer_html())
<div>
    <p>Hello, PYCON TW!</p>
</div>
                        

Manipulating

  • .append
  • .append_to
  • .prepend
  • .prepend_to
  • .after
  • .insert_after
  • .before
  • .insert_before
  • .wrap
  • .wrap_all
  • .replace_with
  • .replace_all
  • .clone
  • .empty
  • .remove
  • .Fn

PyQuery Specifics

  • .base_url
  • .make_links_absolute

PyQuery Specifics (cont.)


url = 'https://aji.tw'

dom = PyQuery(url).make_links_absolute()

# or
import requests
resp = requests.get(url)
dom = PyQuery(resp.text).make_links_absolute(url)
                        

PyQuery

vs

BeautifulSoup

Built-in URL Opener


>>> url = 'http://www.kyle.com.tw/'

# pyquery
>>> PyQuery(url)
[html]

# BeautifulSoup
>>> BeautifulSoup(url)
Beautiful Soup is not an HTTP client. You should probably
use an HTTP client like requests to get the document behind
the URL, and feed that document to Beautiful Soup.
                        

Find First Element


html = '''
<div class="r-list-container">
    <div class="r-ent">...</div>
    <div class="r-ent">...</div>
    <div class="r-ent">...</div>
    <div class="r-ent">...</div>
<div>
'''
soup = BeautifulSoup(html)
dom = PyQuery(html)

# BeautifulSoup
soup.find(class_='r-ent').text
soup.select_one('.r-ent').text

# PyQuery
dom('.r-ent:first').text()
dom('.r-ent').eq(0).text()
                        

Pseudo-element Support

Find the text `First`

<ul>
    <li>First</li>
    <li>Second</li>
    <li>Third</li>
</ul>
                        
| CSS Selector | PyQuery | BeautifulSoup | |-----------------------------|----------|---------------| | `ul > li:nth-of-type(1)` | 😄 | 😄 | | `ul > li:first` | 😄 | 😢 | | `ul > li:eq(0)` | 😄 | 😢 | | `ul > li:contains("First")` | 😄 | 😢 |

Iterate Elements and Get Attributes


html = '''
<div class="r-list-container">
    <div class="r-ent">...
        <a href="https://ptt.cc...">[耍冷] 不要用手指月亮</a>
    </div>
    <div class="r-ent">...
        <a href="https://ptt.cc...">[笑話] 很大隻的黑狗</a>
    </div>
    <div class="r-ent">...
        <a href="https://ptt.cc...">[猜謎] 寂寞了要找哪個單位</a>
    </div>
    <div class="r-ent">...
        <a href="https://ptt.cc...">[猜謎] 今天是幾月幾號?</a>
    </div>
<div>
'''
# BeautifulSoup
[i.find('a').attrs['href'] for i in soup.find_all(class_='r-ent')]
[i.attrs['href'] for i in soup.select('.r-ent a')]

# PyQuery
[i.attr('href') for i in dom('.r-ent a').items()]  # list comprehension
dom('.r-ent a').map(lambda i, e: PyQuery(e).attr('href'))  # PyQuery.map
dom('.r-ent a').map(lambda i, e: e.attrib['href'])  # PyQuery.map with lxml.html.HtmlElement.attrib
                        

Select Text Without Tag

Select Text Without Tag (cont.)


<div id="main-content" class="bbs-screen bbs-content">
    <div class="article-metaline">...</div>
    <div class="article-metaline">...</div>
    <div class="article-metaline">...</div>
    <div class="article-metaline">...</div>
    助人為快樂之本溫泉 --
    <span class="f2">...</span>
    <span class="f2">...</span>
    <div class="push">...</div>
    <div class="push">...</div>
    <div class="push">...</div>
</div>
                        

# BeautifulSoup
soup.find_all(class_='article-metaline')[-1].nextSibling

# PyQuery
dom('#main-content').remove('*').text()
dom('#main-content').clone().remove('*').text()
                        

Find All Elements Before Certain Element

Find All Elements Before Certain Element (cont.)


<class="r-ent">[囧rz] 一直睡覺的狗</div>
<class="r-ent">[猜謎] 超黑的人 猜一部電影</div>
<class="r-ent">[笑話] 古代文字獄</div>
<class="r-list-sep"></div>
<class="r-ent">![公告] joke版規 (2017/05/09 更新)</div>
                        
```python # BeautifulSoup eles = [] for ele in soup.find_all(attrs={'class': ['r-ent', 'r-list-sep']}): if ele.attrs['class'][0] == 'r-list-sep': break else: eles.append(ele) # PyQuery eles = dom('.r-list-sep').prev_all('.r-ent') ```

Digging Deeper

More about URL Opener


from pyquery import PyQuery

# These are equivalent
PyQuery(url='https://aji.tw')
PyQuery('https://aji.tw')

# add cookies
PyQuery('http://www.anno.solutions/', cookies=dict(over18='1'))

# add headers
PyQuery(
    'https://aji.tw',
    headers={'User-Agent': 'I am not a robot!'}
)
                        

More about URL Opener (cont.)

`PyQuery` allows you to add a custom url opener by passing `opener` as an argument.

from pyquery import PyQuery
from selenium.webdriver import Firefox

def selenium_opener(url):
    driver = Firefox()
    driver.get(url)
    html = driver.page_source
    driver.quit()
    return html

if __name__ == '__main__':
    PyQuery('https://aji.tw', opener=selenium_opener)
                        

Custom Function

You can do this by sub-classing `PyQuery`

from pyquery import PyQuery

class PyQueryPlus(PyQuery):

    def exists(self):
        return True if len(self) else False
                        

Custom Function (cont.)

Or, you can do this in a more jquery-like way.

from pyqury import PyQuery

# using lambda (more succinct)
PyQuery.fn.exists = lambda: True if len(this) else False

# using def (more readable)
def exists():
    return True if len(this) else False

PyQuery.fn.exists = exists
                        

Custom Function (cont.)

A more complex example: `next_until`
pyquery.py

class PyQuery(list):

    # ...

    @with_camel_case_alias
    def next_until(self, selector, filter_=None):
        eles = []
        for e in self:
            next_ = e.getnext()
            while next_ is not None:
                if next_.cssselect(selector):
                    break
                if filter_:
                    if next_.cssselect(filter_):
                        eles.append(next_)
                else:
                    eles.append(next_)

                next_ = next_.getnext()
        return self._copy(eles, parent=self)
                        

Custom Pseudo-element Parser

Add `:second` which selects the second element (acting like `:first`) as an example
cssselectpatch.py

class JQueryTranslator:

    # ...

    def xpath_second_pseudo(self, xpath):
        xpath.add_post_condition('position() = 2')
        return xpath

if __name__ == '__main__':
    dom = PyQuery('https://aji.tw')
    dom('h2:second')
                        

Monkey-patching

`:not` is being parsed incorrectly, so a monkey-patch can be applied.
```python import re from pyquery import PyQuery from pyquery.cssselectpatch import JQueryTranslator class JQueryTranslatorPlus(JQueryTranslator): def css_to_xpath(self, *args, **kwargs): target = None pat = re.search('(.+):not\(:(.+)\)', args[0]) if pat: node, pseudo = map(pat.group, (1, 2)) print(pseudo) if pseudo == 'first': target = 1 elif pseudo == 'last': target = 'last()' else: pseudo_pat = re.search('(nth-of-type|eq)\((\d+)\)', pseudo) if pseudo_pat: pseudo_func, num = map(pseudo_pat.group, (1, 2)) num = int(num) target = num + 1 if pseudo_func == 'eq' else num if target: xpath = super().css_to_xpath(node) + '[position() != {}]'.format(target) else: xpath = super().css_to_xpath(*args, **kwargs) return xpath class PyQueryPlus(PyQuery): _translator_class = JQueryTranslatorPlus if __name__ == '__main__': dom = PyQueryPlus('http://aji.tw') dom('h2:not(:first)') dom('h2:not(:last)') dom('h2:not(:eq(0))') dom('h2:not(:nth-of-type(1))') ```

Practical Experience

Selenium + PyQuery

`selenium` does not support selectors with pseudo-elements very well
```python from selenium import Firefox firefox = Firefox() firefox.find_element_by_css_selector firefox.find_elements_by_css_selector ```

Selenium + PyQuery (cont.)

Directly use `JQueryTranslator`
```python from selenium.webdriver import Firefox from pyquery.cssselectpatch import JQueryTranslator class FirefoxPlus(Firefox): def find_element_by_pyquery(self, css): xpath = JQueryTranslator().css_to_xpath(css) return self.find_element_by_xpath(xpath) def find_elements_by_pyquery(self, css): xpath = JQueryTranslator().css_to_xpath(css) return self.find_elements_by_xpath(xpath) if __name__ == '__main__': browser = FirefoxPlus() # visit "PTT Gossiping" browser.get('http://ptt.cc/bbs/Gossiping/') # click "我同意,我已年滿十八歲" button browser.find_element_by_pyquery('button[name="yes"]').click() ```

Selenium + PyQuery (cont.)

Add `PyQuery` object to `webdriver`
```python from selenium.webdriver import Firefox from pyquery import PyQuery class FirefoxPlus(Firefox): @property def dom(self): return PyQuery(self.page_source) if __name__ == '__main__': browser = FirefoxPlus() browser.get('http://ptt.cc/bbs/Gossiping/') print(browser.dom) ```

Scrapy + PyQuery

If you don't like Scrapy's ad hoc CSS pseudo classes...

response.css('.r-ent::text').extract()
response.css('.r-ent::attr(href)').extract()
                        

Scrapy + PyQuery (cont.)

You can set `PyQuery` object to `Scrapy` response
middlewares.py

from pyquery import PyQuery

class PyQueryMiddleware:

    def process_response(self, request, response, spider):
        response.dom = PyQuery(response.text)
        return response
                        
settings.py
```python DOWNLOADER_MIDDLEWARES = { # ... 'someproject.middlewares.PyQueryMiddleware': 543, } ```

Final Words

+ PyQuery might be your best friend for web crawling. + Same logic at all times + Let's make `pyquery` stronger! ([Github issues](https://github.com/gawel/pyquery/issues))

Thank you