FAQ

`tabula-py` does not work

There are several possible reasons, but tabula-py is just a wrapper of tabula-java , make sure you’ve installed Java, and you can use java command on your terminal. Many issue reporters forget to set PATH for java command.

You can check whether tabula-py can call java from the Python process with tabula.environment_info() function.

I can’t run `from tabula import read_pdf`

If you’ve installed tabula, it will conflict with the namespace. You should install tabula-py after removing tabula.

pip uninstall tabula
pip install tabula-py

I got an empty DataFrame. How can I resolve it?

tabula-py and tabula-java don’t support image-based PDFs. It should contain text-based table information.

Before tuning the tabula-py option, you have to check you set an appropriate pages option. By default, tabula-py extracts tables from the first page of your PDF, with pages=1 argument. If you want to extract from all pages, you need to set pages option like pages="all" or pages=[1, 2, 3]. You might want to extract multiple tables from multiple pages, if so you need to set multiple_tables=True together.

Depending on the PDF’s complexity, it might be difficult to extract table contents accurately.

Tuning points of tabula-py are limited:

Set specific area for accurate table detection
Try lattice=True option for the table having explicit lines. Or try stream=True option

To know the limitation of tabula-java, I highly recommend using tabula app, the GUI version of tabula-java.

tabula app can:

specify the area with GUI
show a preview of the extraction with lattice or stream mode
export template that is reusable for tabula-py

Even if you can’t extract tabula-py for those table contents which can be extracted tabula app appropriately, file an issue on GitHub.

The result is different from `tabula-java`. Or, `stream` option seems not to work appropriately

tabula-py set guess option True by default, for beginners. It is known to make a conflict between stream option. If you feel something strange with your result, please set guess=False.

Can I use option `xxx`?

Yes. You can use options argument as follows. The format is the same as CLI of tabula-java.

read_pdf(file_path, options="--columns 10.1,20.2,30.3")

How can I ignore useless area?

In short, you can extract with area and spreadsheet options.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use `area` option

According to tabula-java wiki, there is an explanation of how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

For example, using macOS’s preview, I got area information of this PDF:

java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

# Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

I faced `ParserError: Error tokenizing data. C error`. How can I extract multiple tables?

This error occurs when pandas tries to extract multiple tables with different column size at once. Use multiple_tables option, then you can avoid this error.

I want to prevent tabula-py from stealing focus on every call on my mac

Set java_options=["-Djava.awt.headless=true"]. kudos @jakekara

I got `?` character with results on Windows. How can I avoid it?

If the encoding of PDF is UTF-8, you should set chcp 65001 on your terminal before launching a Python process.

chcp 65001

Then you can extract UTF-8 PDF with java_options="-Dfile.encoding=UTF8" option. This option will be added with encoding='utf-8' option, which is also set by default.

# This is an example for java_options is set explicitly
df = read_pdf(file_path, java_options="-Dfile.encoding=UTF8")

Replace 65001 and UTF-8 appropriately, if the file encoding isn’t UTF-8.

I can’t extract file/directory names with space on Windows

You should escape the file/directory name yourself.

I want to use a different tabula .jar file

You can specify the jar location via environment variable

export TABULA_JAR=".../tabula-x.y.z-jar-with-dependencies.jar"

I want to extract multiple tables from a document

You can use the following example code

df = read_pdf(file_path, multiple_tables=True)

The result will be a list of DataFrames. If you want separate tables across all pages in a document, use the pages argument.

Table cell contents sometimes overflow into the next row.

You can try using lattice=True, which will often work if there are lines separating cells in the table.

I got a warning/error message from PDFBox including `org.apache.pdfbox.pdmodel.`. Is it the cause of the empty dataframe?

No.

Sometimes, you might see a message like `` Jul 17, 2019 10:21:25 AM org.apache.pdfbox.pdmodel.font.PDType1Font WARNING: Using fallback font NimbusSanL-Regu for Univers. Nothing was parsed from this one.`` This error message came from Apache PDFBox which is used under tabula-java, and this is caused by the PDF itself. Neither tabula-py nor tabula-java can’t handle the warning itself, except for the silent option that suppresses the warning.

`java_options` is ignored once `read_pdf` or similar funcion is called.

Since jpype doesn’t support changing JVM options after the JVM is started, java_options is ignored once read_pdf or similar funcion is called. If you want to change JVM options, you need to restart the Python process. See also: https://jpype.readthedocs.io/en/latest/api.html#jpype.shutdownJVM

I can’t figure out accurate extraction with tabula-py. Are there any similar Python libraries?

I know tabula-py has limitations depending on tabula-java. Sometimes your PDF is too complex to tabula-py. If you want to find plan B, there are similar packages as the following:

FAQ

tabula-py does not work

I can’t run from tabula import read_pdf