Awesome Open Source
Awesome Open Source

HTML Table Extractor

Build Status

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

1 2
3 4
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

1 2
3 4
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

1 2
3 4
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1 2 3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1 2 3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

1 2
3 4
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Copyright (c) 2017 Justin Li. Released under the MIT License

Third-party copyright in this distribution is noted where applicable.

Misc

How to upload the package to pypi (for the reference of the owner)

  • python setup.py bdist_wheel --universal
  • twine upload dist/* --verbose


Alternative Project Comparisons
Related Awesome Lists
Top Programming Languages
Top Projects

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (821,947
Html (246,614
Table (23,072
Extract (11,916
Scraping (11,844
Crawler (10,174
Extractor (2,783
Beautifulsoup (1,857
Html Table (325
Extract Data (74