FAQ
tabula-py
does not work
There are several possible reasons, but tabula-py
is just a wrapper of tabula-java , make sure you’ve installed Java, and you can use java
command on your terminal. Many issue reporters forget to set PATH for java
command.
You can check whether tabula-py can call java
from the Python process with tabula.environment_info()
function.
I can’t run from tabula import read_pdf
If you’ve installed tabula
, it will conflict with the namespace. You should install tabula-py
after removing tabula
.
pip uninstall tabula
pip install tabula-py
I got an empty DataFrame. How can I resolve it?
tabula-py and tabula-java don’t support image-based PDFs. It should contain text-based table information.
Before tuning the tabula-py option, you have to check you set an appropriate pages
option. By default, tabula-py extracts tables from the first page of your PDF, with pages=1
argument.
If you want to extract from all pages, you need to set pages
option like pages="all"
or pages=[1, 2, 3]
.
You might want to extract multiple tables from multiple pages, if so you need to set multiple_tables=True
together.
Depending on the PDF’s complexity, it might be difficult to extract table contents accurately.
Tuning points of tabula-py are limited:
Set specific
area
for accurate table detectionTry
lattice=True
option for the table having explicit lines. Or trystream=True
option
To know the limitation of tabula-java, I highly recommend using tabula app, the GUI version of tabula-java.
tabula app can:
specify the area with GUI
show a preview of the extraction with lattice or stream mode
export template that is reusable for tabula-py
Even if you can’t extract tabula-py for those table contents which can be extracted tabula app appropriately, file an issue on GitHub.
The result is different from tabula-java
. Or, stream
option seems not to work appropriately
tabula-py
set guess
option True
by default, for beginners. It is known to make a conflict between stream
option. If you feel something strange with your result, please set guess=False
.
Can I use option xxx
?
Yes. You can use options
argument as follows. The format is the same as CLI of tabula-java.
read_pdf(file_path, options="--columns 10.1,20.2,30.3")
How can I ignore useless area?
In short, you can extract with area
and spreadsheet
options.
In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
Unnamed: 0 Col2 Col3 Col4 Col5
0 A B 12 R G
1 NaN R T 23 H
2 B B 33 R A
3 C T 99 E M
4 D I 12 34 M
5 E I I W 90
6 NaN 1 2 W h
7 NaN 4 3 E H
8 F E E4 R 4
How to use area
option
According to tabula-java wiki, there is an explanation of how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
For example, using macOS’s preview, I got area information of this PDF:
java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
given
# Note the left, top, height, and width parameters and calculate the following:
y1 = top
x1 = left
y2 = top + height
x2 = left + width
I confirmed with tabula-java:
java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf
Without -r
(same as --spreadsheet
) option, it does not work properly.
I faced ParserError: Error tokenizing data. C error
. How can I extract multiple tables?
This error occurs when pandas tries to extract multiple tables with different column size at once.
Use multiple_tables
option, then you can avoid this error.
I want to prevent tabula-py from stealing focus on every call on my mac
Set java_options=["-Djava.awt.headless=true"]
. kudos @jakekara
I got ?
character with results on Windows. How can I avoid it?
If the encoding of PDF is UTF-8, you should set chcp 65001
on your terminal before launching a Python process.
chcp 65001
Then you can extract UTF-8 PDF with java_options="-Dfile.encoding=UTF8"
option. This option will be added with encoding='utf-8'
option, which is also set by default.
# This is an example for java_options is set explicitly
df = read_pdf(file_path, java_options="-Dfile.encoding=UTF8")
Replace 65001
and UTF-8
appropriately, if the file encoding isn’t UTF-8.
I can’t extract file/directory names with space on Windows
You should escape the file/directory name yourself.
I want to use a different tabula .jar file
You can specify the jar location via environment variable
export TABULA_JAR=".../tabula-x.y.z-jar-with-dependencies.jar"
I want to extract multiple tables from a document
You can use the following example code
df = read_pdf(file_path, multiple_tables=True)
The result will be a list of DataFrames. If you want separate tables across all pages in a document, use the pages
argument.
Table cell contents sometimes overflow into the next row.
You can try using lattice=True
, which will often work if there are lines separating cells in the table.
I got a warning/error message from PDFBox including org.apache.pdfbox.pdmodel.
. Is it the cause of the empty dataframe?
No.
Sometimes, you might see a message like `` Jul 17, 2019 10:21:25 AM org.apache.pdfbox.pdmodel.font.PDType1Font WARNING: Using fallback font NimbusSanL-Regu for Univers. Nothing was parsed from this one.`` This error message came from Apache PDFBox which is used under tabula-java, and this is caused by the PDF itself. Neither tabula-py nor tabula-java can’t handle the warning itself, except for the silent option that suppresses the warning.
java_options
is ignored once read_pdf
or similar funcion is called.
Since jpype doesn’t support changing JVM options after the JVM is started, java_options
is ignored once read_pdf
or similar funcion is called. If you want to change JVM options, you need to restart the Python process.
See also: https://jpype.readthedocs.io/en/latest/api.html#jpype.shutdownJVM
I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?
I know tabula-py has limitations depending on tabula-java. Sometimes your PDF is too complex to tabula-py. If you want to find plan B, there are similar packages as the following: