tabula-py: Read tables in a PDF into DataFrame¶
tabula-py
is a simple Python wrapper of tabula-java, which can read table of PDF.
You can read tables from PDF and convert into pandas’s DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.
We highly recommend to look at the example notebook and try it on Google Colab.
For high level API reference, see High level interfaces.
Contents
- Getting Started
- FAQ
tabula-py
does not work- I can’t run
from tabula import read_pdf
- I got a empty DataFrame. How can I resolve it?
- The result is different from
tabula-java
. Or,stream
option seems not to work appropriately - Can I use option
xxx
? - How can I ignore useless area?
- I faced
ParserError: Error tokenizing data. C error
. How can I extract multiple tables? - I want to prevent tabula-py from stealing focus on every call on my mac
- I got
?
character with result on Windows. How can I avoid it? - I can’t extract file/directory name with space on Windows
- I want to use a different tabula .jar file
- I want to extract multiple tables from a document
- Table cell contents sometimes overflow into the next row.
- I got a warning/error message from PDFBox including
org.apache.pdfbox.pdmodel.
. Is it the cause of empty dataframe? - I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?
- Contributing to tabula-py
API Reference