BeautifulSoup
BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
import requests
from bs4 import BeautifulSoup
request = requests.get('{{ url }}')
soup = BeautifulSoup(request.text, "html.parser")
Here are some simple ways to navigate that data structure:
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Installation⚑
pip install beautifulsoup4
The default parser html.parser doesn't work with HTML5, so you'll probably need to use the html5lib parser, it's not included by default, so you might need to install it as well
pip install html5lib
Usage⚑
Kinds of objects⚑
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.
Tag⚑
A Tag object corresponds to an XML or HTML tag in the original document:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
The most important features of a tag are its name and attributes.
Name⚑
Every tag has a name, accessible as .name:
tag.name
# u'b'
If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:.
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
Attributes⚑
A tag may have any number of attributes. The tag <b id="boldest"> has an attribute id whose value is boldest. You can access a tag’s attributes by treating the tag like a dictionary:
tag['id']
# u'boldest'
You can access that dictionary directly as .attrs:
tag.attrs
# {u'id': 'boldest'}
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# <b another-attribute="1" id="verybold"></b>
del tag['id']
del tag['another-attribute']
tag
# <b></b>
tag['id']
# KeyError: 'id'
print(tag.get('id'))
# None
Multi-valued attributes⚑
HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are consolidated:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>
If you parse a document as XML, there are no multi-valued attributes:
NavigableString⚑
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with unicode():
unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>
You can’t edit a string in place, but you can replace one string with another, using replace_with():
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
BeautifulSoup⚑
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.
Navigating the tree⚑
Going down⚑
Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.
Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.
Navigating using tag names⚑
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.head:
soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>
You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first <b> tag beneath the <body> tag:
soup.body.b
# <b>The Dormouse's story</b>
Using a tag name as an attribute will give you only the first tag by that name:
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
If you need to get all the <a> tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
.contents and .children⚑
 A tag’s children are available in a list called .contents:
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
for child in title_tag.children:
    print(child)
# The Dormouse's story
.descendants⚑
 The .contents and .children attributes only consider a tag’s direct children. For instance, the <head> tag has a single direct child–the <title> tag:
head_tag.contents
# [<title>The Dormouse's story</title>]
But the <title> tag itself has a child: the string The Dormouse’s story. There’s a sense in which that string is also a child of the <head> tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:.
for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
.string⚑
 If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string
# u'The Dormouse's story'
If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
head_tag.contents
# [<title>The Dormouse's story</title>]
head_tag.string
# u'The Dormouse's story'
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
.strings and .stripped_strings⚑
 If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:
for string in soup.strings:
    print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:
for string in soup.stripped_strings:
    print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
Going up⚑
Continuing the “family tree” analogy, every tag and every string has a parent: the tag that contains it.
.parent⚑
 You can access an element’s parent with the .parent attribute.
title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
.parents⚑
 You can iterate over all of an element’s parents with .parents.
Going sideways⚑
When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.
.next_sibling and .previous_sibling⚑
 You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:.
sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>
The <b> tag has a .next_sibling, but no .previous_sibling, because there’s nothing before the <b> tag on the same level of the tree. For the same reason, the <c> tag has a .previous_sibling but no .next_sibling:
print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None
In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace.
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
You might think that the .next_sibling of the first <a> tag would be the second <a> tag. But actually, it’s a string: the comma and newline that separate the first <a> tag from the second:
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
link.next_sibling
# u',\n'
The second <a> tag is actually the .next_sibling of the comma:
link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
.next_siblings and .previous_siblings⚑
 You can iterate over a tag’s siblings with .next_siblings or .previous_siblings:
for sibling in soup.a.next_siblings:
    print(repr(sibling))
# u',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# u'; and they lived at the bottom of a well.'
# None
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
# ' and\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# u'Once upon a time there were three little sisters; and their names were\n'
# None
Searching the tree⚑
By passing in a filter to an argument like find_all(), you can zoom in on the parts of the document you’re interested in.
Kinds of filters⚑
A string⚑
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the <b> tags in the document:
soup.find_all('b')
# [<b>The Dormouse's story</b>]
A regular expression⚑
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method. This code finds all the tags whose names start with the letter b; in this case, the <body> tag and the <b> tag:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b
A list⚑
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the <a> tags and all the <b> tags:
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
A function⚑
If none of the other matches work for you, define a function that takes an element as its only argument. The function should return True if the argument matches, and False otherwise.
Here’s a function that returns True if a tag defines the class attribute but doesn’t define the id attribute:
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
Pass this function into find_all() and you’ll pick up all the <p> tags:
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]
find_all()⚑
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
soup.find_all("title")
# [<title>The Dormouse's story</title>]
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'
The name argument⚑
 Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This is the simplest usage:
soup.find_all("title")
# [<title>The Dormouse's story</title>]
The keyword arguments⚑
 Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s id attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
You can filter an attribute based on a string, a regular expression, a list, a function, or the value True.
You can filter multiple attributes at once by passing in more than one keyword argument:
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
Searching by CSS class⚑
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
The string argument⚑
With string you can search for strings instead of tags.
soup.find_all(string="Elsie")
# [u'Elsie']
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string.
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
The limit argument⚑
find_all() returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for limit.
The recursive argument⚑
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False.
Calling a tag is like calling find_all()⚑
Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:
soup.find_all("a")
soup("a")
find()⚑
 find() is like find_all() but returning just one result.
find_parent() and find_parents()⚑
 These methods work their way up the tree, looking at a tag’s (or a string’s) parents.
find_next_siblings() and find_next_sibling()⚑
 These methods use .next_siblings to iterate over the rest of an element’s siblings in the tree. The find_next_siblings() method returns all the siblings that match, and find_next_sibling() only returns the first one:
first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
first_link.find_next_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
To go in the other direction you can use find_previous_siblings() and find_previous_sibling()
Modifying the tree⚑
replace_with⚑
 PageElement.replace_with() removes a tag or string from the tree, and replaces it with the tag or string of your choice:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
a_tag
# <a href="http://example.com/">I linked to <b>example.net</b></a>
Sometimes it doesn't work. If it doesn't use:
a_tag.clear()
a_tag.append(new_tag)
Tips⚑
Show content beautified / prettified⚑
Use print(soup.prettify()).
Cleaning escaped HTML code⚑
soup = BeautifulSoup(s.replace(r"\"", '"').replace(r"\/", "/"), "html.parser")