If you’ve ever worked with Parsel to scrape HTML, you’ve probably used XPath to extract parts of the page.
But here’s a trap that gets a lot of people (myself included):
table.xpath("//tbody")
You’d expect this to return the <tbody>
inside the current table
, right?
But it might actually give you the <tbody>
from a completely different table elsewhere on the page.
Let’s walk through why this happens — with a clear example — and how to fix it using relative XPath.
🧪 Example HTML with Two Tables
Here’s a simple HTML snippet with two tables:
<html>
<body>
<table id="first">
<thead><tr><th>Item</th><th>Price</th></tr></thead>
<tbody>
<tr><td>Apple</td><td>$1</td></tr>
</tbody>
</table>
<table id="second">
<thead><tr><th>Name</th><th>Age</th></tr></thead>
<tbody>
<tr><td>Alice</td><td>30</td></tr>
<tr><td>Bob</td><td>25</td></tr>
</tbody>
</table>
</body>
</html>
You want to get rows from only the second table (with id="second"
).
❌ Absolute XPath: The Wrong Way (Usually)
from parsel import Selector
selector = Selector(text=html)
table = selector.xpath('//table[@id="second"]')
tbody = table.xpath('//tbody') # ⛔ This is the problem!
What’s wrong here?
//tbody
is an absolute XPath.- It ignores the fact that you’re inside
table
. - It starts from the top of the document and finds all
<tbody>
elements.
Result? It could return both <tbody>
elements (from both first
and second
tables) — or the wrong one entirely.
✅ Relative XPath: The Right Way
tbody = table.xpath('.//tbody') # ✅ Note the dot!
- The dot
.
means: start from this node (in this case, the second table). .//tbody
says: look inside this table, and find all<tbody>
elements underneath.
This returns only the <tbody>
for table#second
, as expected.
💡 Why This Happens
XPath expressions behave differently depending on how you write them:
Expression | Means |
---|---|
//tbody |
Look for all <tbody> elements anywhere in the document (starts from the root) |
.//tbody |
Look for <tbody> elements inside the current node |
Even though you’re calling .xpath()
on a specific node, starting with //
resets the search back to the whole page.
That’s why using the dot .
is so important when you want to limit your search to a specific part of the page.
✅ Real Example: Extracting Rows from a Specific Table
Here’s how you might use this properly in code:
from parsel import Selector
html = open("two_tables.html").read()
selector = Selector(text=html)
# Get only the second table
table = selector.xpath('//table[@id="second"]')
# Use relative XPath to get rows inside this table
rows = table.xpath('.//tbody/tr')
for row in rows:
cols = row.xpath('./td/text()').getall()
print(cols)
Output:
['Alice', '30']
['Bob', '25']
Perfect!
Summary: Absolute vs. Relative XPath
XPath | Starts From | Use Case |
---|---|---|
//tbody |
Entire HTML | Use when you want to search globally |
.//tbody |
Current Node | Use when you’re drilling into a specific element |
./td |
Current Node | Get child nodes from the current row |
Final Thoughts
If you’re chaining .xpath()
calls in Parsel and wondering why you’re getting unexpected results, check whether you’re using absolute (//
) or relative (.//
) XPath.
Adding that little .
makes all the difference.