tabula-py does not work¶
There are several possible reasons, but
tabula-py is just a wrapper of tabula-java , make sure you’ve installed Java and you can use
java command on your terminal. Many issue reporters forget to set PATH for
You can check whether tabula-py can call
java from Python process with
I can’t run
from tabula import read_pdf¶
If you’ve installed
tabula, it will be conflict the namespace. You should install
tabula-py after removing
pip uninstall tabula pip install tabula-py
I got a empty DataFrame. How can I resolve it?¶
tabula-py and tabula-java don’t support image based PDF. It should contain text based table information.
Before tuning the tabula-py option, you have to check you set an appropriate pages option. By default, tabula-py extracts table from first page of your PDF, with pages=1 argument. If you want to extract from all pages, you need to set pages option like pages=”all” or pages=[1, 2, 3]. You might want to extract multiple tables from multiple pages, if so you need to set multiple_tables=True together.
Depending on the PDF’s complexity, it might be difficult to extract table contents accuracy.
Tuning points of tabula-py are limited:
- Set specific
areafor accurate table detection
lattice=Trueoption for the table having explicit line. Or try
tabula app can:
- specify the area with GUI
- show preview of the extraction with lattice or stream mode
- export template that is reusable for tabula-py
Even if you can’t extract tabula-py for those table contents which can be extracted tabula app appropriately, file an issue on GitHub.
The result is different from
stream option seems not to work appropriately¶
True by default, for beginners. It is known to make a conflict between
stream option. If you feel something strange with your result, please set
Can I use option
Yes. You can use
options argument as following. The format is same as cli of tabula-java.
read_pdf(file_path, options="--columns 10.1,20.2,30.3")
How can I ignore useless area?¶
In short, you can extract with
In : tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91)) Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Out: Unnamed: 0 Col2 Col3 Col4 Col5 0 A B 12 R G 1 NaN R T 23 H 2 B B 33 R A 3 C T 99 E M 4 D I 12 34 M 5 E I I W 90 6 NaN 1 2 W h 7 NaN 4 3 E H 8 F E E4 R 4
How to use
According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
For example, using macOS’s preview, I got area information of this PDF:
java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
# Note the left, top, height, and width parameters and calculate the following: y1 = top x1 = left y2 = top + height x2 = left + width
I confirmed with tabula-java:
java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf
--spreadsheet) option, it does not work properly.
ParserError: Error tokenizing data. C error. How can I extract multiple tables?¶
This error occurs when pandas trys to extract multiple tables with different column size at once.
multiple_tables option, then you can avoid this error.
I want to prevent tabula-py from stealing focus on every call on my mac¶
java_options=["-Djava.awt.headless=true"]. kudos @jakekara
? character with result on Windows. How can I avoid it?¶
If the encoding of PDF is UTF-8, you should set
chcp 65001 on your terminal before launching a Python process.
Then you can extract UTF-8 PDF with
java_options="-Dfile.encoding=UTF8" option. This option will be added with
encoding='utf-8' option, which is also set by default.
# This is an example for java_options is set explicitly df = read_pdf(file_path, java_options="-Dfile.encoding=UTF8")
UTF-8 appropriately, if the file encoding isn’t UTF-8.
I can’t extract file/directory name with space on Windows¶
You should escape file/directory name yourself.
I want to use a different tabula .jar file¶
You can specify the jar location via enviroment variable
I want to extract multiple tables from a document¶
You can use the following example code
df = read_pdf(file_path, multiple_tables=True)
The result will be a list of DataFrames. If you want separate tables across all pages in a document, use the
Table cell contents sometimes overflow into the next row.¶
You can try using
lattice=True, which will often work if there are lines separating cells in the table.
I got a warning/error message from PDFBox including
org.apache.pdfbox.pdmodel.. Is it the cause of empty dataframe?¶
Sometimes, you might see message like `` Jul 17, 2019 10:21:25 AM org.apache.pdfbox.pdmodel.font.PDType1Font WARNING: Using fallback font NimbusSanL-Regu for Univers. Nothing was parsed from this one.`` This error message came from Apache PDFBox which is used under tabula-java, and this is caused by the PDF itself. Neither tabula-py nor tabula-java can’t handle the warning itself, except for silent option that suppress the warning.
I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?¶
I know tabula-py has limitation depending on tabula-java. Sometimes your PDF is too complex to tabula-py. If you want to find plan B, there are similar packages as the following: