tabula-py: Read tables in a PDF into DataFrame
tabula-py
is a simple Python wrapper of tabula-java, which can read table of PDF.
You can read tables from PDF and convert them into pandas’ DataFrame. tabula-py also converts a PDF file into CSV/TSV/JSON file.
We highly recommend looking at the example notebook and trying it on Google Colab.
For high-level API reference, see High level interfaces.
- Getting Started
- FAQ
tabula-py
does not work- I can’t run
from tabula import read_pdf
- I got an empty DataFrame. How can I resolve it?
- The result is different from
tabula-java
. Or,stream
option seems not to work appropriately - Can I use option
xxx
? - How can I ignore useless area?
- I faced
ParserError: Error tokenizing data. C error
. How can I extract multiple tables? - I want to prevent tabula-py from stealing focus on every call on my mac
- I got
?
character with results on Windows. How can I avoid it? - I can’t extract file/directory names with space on Windows
- I want to use a different tabula .jar file
- I want to extract multiple tables from a document
- Table cell contents sometimes overflow into the next row.
- I got a warning/error message from PDFBox including
org.apache.pdfbox.pdmodel.
. Is it the cause of the empty dataframe? java_options
is ignored onceread_pdf
or similar funcion is called.- I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?
- Contributing to tabula-py