tabula-py does not work
There are several possible reasons, but
tabula-py is just a wrapper of tabula-java , make sure you’ve installed Java, and you can use
java command on your terminal. Many issue reporters forget to set PATH for
You can check whether tabula-py can call
java from the Python process with
I can’t run
from tabula import read_pdf
If you’ve installed
tabula, it will conflict with the namespace. You should install
tabula-py after removing
pip uninstall tabula pip install tabula-py
I got an empty DataFrame. How can I resolve it?
tabula-py and tabula-java don’t support image-based PDFs. It should contain text-based table information.
Before tuning the tabula-py option, you have to check you set an appropriate
pages option. By default, tabula-py extracts tables from the first page of your PDF, with
If you want to extract from all pages, you need to set
pages option like
pages=[1, 2, 3].
You might want to extract multiple tables from multiple pages, if so you need to set
Depending on the PDF’s complexity, it might be difficult to extract table contents accurately.
Tuning points of tabula-py are limited:
areafor accurate table detection
lattice=Trueoption for the table having explicit lines. Or try
tabula app can:
specify the area with GUI
show a preview of the extraction with lattice or stream mode
export template that is reusable for tabula-py
Even if you can’t extract tabula-py for those table contents which can be extracted tabula app appropriately, file an issue on GitHub.
The result is different from
stream option seems not to work appropriately
True by default, for beginners. It is known to make a conflict between
stream option. If you feel something strange with your result, please set
Can I use option
Yes. You can use
options argument as follows. The format is the same as CLI of tabula-java.
read_pdf(file_path, options="--columns 10.1,20.2,30.3")
How can I ignore useless area?
In short, you can extract with
In : tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91)) Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Out: Unnamed: 0 Col2 Col3 Col4 Col5 0 A B 12 R G 1 NaN R T 23 H 2 B B 33 R A 3 C T 99 E M 4 D I 12 34 M 5 E I I W 90 6 NaN 1 2 W h 7 NaN 4 3 E H 8 F E E4 R 4
How to use
According to tabula-java wiki, there is an explanation of how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
For example, using macOS’s preview, I got area information of this PDF:
java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
# Note the left, top, height, and width parameters and calculate the following: y1 = top x1 = left y2 = top + height x2 = left + width
I confirmed with tabula-java:
java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf
--spreadsheet) option, it does not work properly.
ParserError: Error tokenizing data. C error. How can I extract multiple tables?
This error occurs when pandas tries to extract multiple tables with different column size at once.
multiple_tables option, then you can avoid this error.
I want to prevent tabula-py from stealing focus on every call on my mac
java_options=["-Djava.awt.headless=true"]. kudos @jakekara
? character with results on Windows. How can I avoid it?
If the encoding of PDF is UTF-8, you should set
chcp 65001 on your terminal before launching a Python process.
Then you can extract UTF-8 PDF with
java_options="-Dfile.encoding=UTF8" option. This option will be added with
encoding='utf-8' option, which is also set by default.
# This is an example for java_options is set explicitly df = read_pdf(file_path, java_options="-Dfile.encoding=UTF8")
UTF-8 appropriately, if the file encoding isn’t UTF-8.
I can’t extract file/directory names with space on Windows
You should escape the file/directory name yourself.
I want to use a different tabula .jar file
You can specify the jar location via environment variable
I want to extract multiple tables from a document
You can use the following example code
df = read_pdf(file_path, multiple_tables=True)
The result will be a list of DataFrames. If you want separate tables across all pages in a document, use the
Table cell contents sometimes overflow into the next row.
You can try using
lattice=True, which will often work if there are lines separating cells in the table.
I got a warning/error message from PDFBox including
org.apache.pdfbox.pdmodel.. Is it the cause of the empty dataframe?
Sometimes, you might see a message like `` Jul 17, 2019 10:21:25 AM org.apache.pdfbox.pdmodel.font.PDType1Font WARNING: Using fallback font NimbusSanL-Regu for Univers. Nothing was parsed from this one.`` This error message came from Apache PDFBox which is used under tabula-java, and this is caused by the PDF itself. Neither tabula-py nor tabula-java can’t handle the warning itself, except for the silent option that suppresses the warning.
java_options is ignored once
read_pdf or similar funcion is called.
Since jpype doesn’t support changing JVM options after the JVM is started,
java_options is ignored once
read_pdf or similar funcion is called. If you want to change JVM options, you need to restart the Python process.
See also: https://jpype.readthedocs.io/en/latest/api.html#jpype.shutdownJVM
I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?
I know tabula-py has limitations depending on tabula-java. Sometimes your PDF is too complex to tabula-py. If you want to find plan B, there are similar packages as the following: