
High level interfaces

This module is a wrapper of tabula, which enables table extraction from a PDF.

This module extracts tables from a PDF into a pandas DataFrame. Currently, the implementation of this module uses subprocess.

Instead of importing this module, you can import public interfaces such as read_pdf(), read_pdf_with_template(), convert_into(), convert_into_by_batch() from tabula module directory.


If you want to use your own tabula-java JAR file, set TABULA_JAR to environment variable for JAR path.


>>> import tabula
>>> df = tabula.read_pdf("/path/to/sample.pdf", pages="all") Union[IO, str, os.PathLike], output_path: str, output_format: str = 'csv', java_options: Optional[List[str]] = None, pages: Union[str, int, List[int], None] = None, guess: bool = True, area: Union[Iterable[float], Iterable[Iterable[float]], None] = None, relative_area: bool = False, lattice: bool = False, stream: bool = False, password: Optional[str] = None, silent: Optional[bool] = None, columns: Optional[List[float]] = None, relative_columns: bool = False, format: Optional[str] = None, batch: Optional[str] = None, options: str = '') → None[source]

Convert tables from PDF into a file. Output file will be saved into output_path.

  • input_path (file like obj) – File like object of target PDF file.
  • output_path (str) – File path of output file.
  • output_format (str, optional) – Output format of this function (csv, json or tsv). Default: csv
  • java_options (list, optional) –

    Set java options



  • pages (str, int, list of int, optional) –

    An optional values specifying pages to extract from. It allows str,`int`, list of :int. Default: 1


    '1-2,3', 'all', [1,2]

  • guess (bool, optional) –

    Guess the portion of the page to analyze per page. Default True If you use “area” option, this option becomes False.


    As of tabula-java 1.0.3, guess option becomes independent from lattice and stream option, you can use guess and lattice/stream option at the same time.

  • area (list of float, list of list of float, optional) –

    Portion of the page to analyze(top,left,bottom,right). Default is entire page.


    If you want to use multiple area options and extract in one table, it should be better to set multiple_tables=False for read_pdf()


    [269.875,12.75,790.5,561], [[12.1,20.5,30.1,50.2], [1.0,3.2,10.5,40.2]]

  • relative_area (bool, optional) – If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default False.
  • lattice (bool, optional) – Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • stream (bool, optional) – Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • password (str, optional) – Password to decrypt document. Default: empty
  • silent (bool, optional) – Suppress all stderr output.
  • columns (list, optional) –

    X coordinates of column boundaries.


    [10.1, 20.2, 30.3]

  • format (str, optional) – Format for output file or extracted object. ("CSV", "TSV", "JSON")
  • batch (str, optional) – Convert all PDF files in the provided directory. This argument should be directory path.
  • options (str, optional) – Raw option string for tabula-java.
  • FileNotFoundError – If downloaded remote file doesn’t exist.
  • ValueError – If output_format is unknown format, or if downloaded remote file size is 0.
  • tabula.errors.JavaNotFoundError – If java is not installed or found.
  • subprocess.CalledProcessError – If tabula-java execution failed. str, output_format: str = 'csv', java_options: Optional[List[str]] = None, pages: Union[str, int, List[int], None] = None, guess: bool = True, area: Union[Iterable[float], Iterable[Iterable[float]], None] = None, relative_area: bool = False, lattice: bool = False, stream: bool = False, password: Optional[str] = None, silent: Optional[bool] = None, columns: Optional[List[float]] = None, relative_columns: bool = False, format: Optional[str] = None, output_path: Optional[str] = None, options: str = '') → None[source]

Convert tables from PDFs in a directory.

  • input_dir (str) – Directory path.
  • output_format (str, optional) – Output format of this function (csv, json or tsv)
  • java_options (list, optional) – Set java options like -Xmx256m.
  • pages (str, int, list of int, optional) –

    An optional values specifying pages to extract from. It allows str,`int`, list of :int. Default: 1


    '1-2,3', 'all', [1,2]

  • guess (bool, optional) –

    Guess the portion of the page to analyze per page. Default True If you use “area” option, this option becomes False.


    As of tabula-java 1.0.3, guess option becomes independent from lattice and stream option, you can use guess and lattice/stream option at the same time.

  • area (list of float, list of list of float, optional) –

    Portion of the page to analyze(top,left,bottom,right). Default is entire page.


    If you want to use multiple area options and extract in one table, it should be better to set multiple_tables=False for read_pdf()


    [269.875,12.75,790.5,561], [[12.1,20.5,30.1,50.2], [1.0,3.2,10.5,40.2]]

  • relative_area (bool, optional) – If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default False.
  • lattice (bool, optional) – Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • stream (bool, optional) – Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • password (str, optional) – Password to decrypt document. Default: empty
  • silent (bool, optional) – Suppress all stderr output.
  • columns (list, optional) –

    X coordinates of column boundaries.


    [10.1, 20.2, 30.3]

  • relative_columns (bool, optional) – If all values are between 0-100 (inclusive) and preceded by ‘%’, input will be taken as % of actual width of the page. Default False.
  • format (str, optional) – Format for output file or extracted object. ("CSV", "TSV", "JSON")
  • options (str, optional) – Raw option string for tabula-java.

Nothing. Outputs are saved into the same directory with input_dir

  • ValueError – If input_dir doesn’t exist.
  • tabula.errors.JavaNotFoundError – If java is not installed or found.
  • subprocess.CalledProcessError – If tabula-java execution failed. Union[IO, str, os.PathLike], output_format: Optional[str] = None, encoding: str = 'utf-8', java_options: Optional[List[str]] = None, pandas_options: Optional[Dict[str, Any]] = None, multiple_tables: bool = True, user_agent: Optional[str] = None, use_raw_url: bool = False, pages: Union[str, int, List[int], None] = None, guess: bool = True, area: Union[Iterable[float], Iterable[Iterable[float]], None] = None, relative_area: bool = False, lattice: bool = False, stream: bool = False, password: Optional[str] = None, silent: Optional[bool] = None, columns: Optional[List[float]] = None, relative_columns: bool = False, format: Optional[str] = None, batch: Optional[str] = None, output_path: Optional[str] = None, options: str = '') → Union[List[pandas.core.frame.DataFrame], Dict[str, Any]][source]

Read tables in PDF.

  • input_path (str, path object or file-like object) – File like object of target PDF file. It can be URL, which is downloaded by tabula-py automatically.
  • output_format (str, optional) – Output format for returned object (dataframe or json) Giving this option enforces to ignore multiple_tables option.
  • encoding (str, optional) – Encoding type for pandas. Default: utf-8
  • java_options (list, optional) –

    Set java options.



  • pandas_options (dict, optional) –

    Set pandas options.


    {'header': None}


    With multiple_tables=True (default), pandas_options is passed to pandas.DataFrame, otherwise it is passed to pandas.read_csv. Those two functions are different for accept options like dtype.

  • multiple_tables (bool) –

    It enables to handle multiple tables within a page. Default: True


    If multiple_tables option is enabled, tabula-py uses not pd.read_csv(), but pd.DataFrame(). Make sure to pass appropriate pandas_options.

  • user_agent (str, optional) – Set a custom user-agent when download a pdf from a url. Otherwise it uses the default urllib.request user-agent.
  • use_raw_url (bool) – It enforces to use input_path string for url without quoting/dequoting. Default: False
  • pages (str, int, list of int, optional) –

    An optional values specifying pages to extract from. It allows str,`int`, list of :int. Default: 1


    '1-2,3', 'all', [1,2]

  • guess (bool, optional) –

    Guess the portion of the page to analyze per page. Default True If you use “area” option, this option becomes False.


    As of tabula-java 1.0.3, guess option becomes independent from lattice and stream option, you can use guess and lattice/stream option at the same time.

  • area (list of float, list of list of float, optional) –

    Portion of the page to analyze(top,left,bottom,right). Default is entire page.


    If you want to use multiple area options and extract in one table, it should be better to set multiple_tables=False for read_pdf()


    [269.875,12.75,790.5,561], [[12.1,20.5,30.1,50.2], [1.0,3.2,10.5,40.2]]

  • relative_area (bool, optional) – If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default False.
  • lattice (bool, optional) – Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • stream (bool, optional) – Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • password (str, optional) – Password to decrypt document. Default: empty
  • silent (bool, optional) – Suppress all stderr output.
  • columns (list, optional) –

    X coordinates of column boundaries.


    [10.1, 20.2, 30.3]

  • relative_columns (bool, optional) – If all values are between 0-100 (inclusive) and preceded by ‘%’, input will be taken as % of actual width of the page. Default False.
  • format (str, optional) – Format for output file or extracted object. ("CSV", "TSV", "JSON")
  • batch (str, optional) – Convert all PDF files in the provided directory. This argument should be directory path.
  • output_path (str, optional) – Output file path. File format of it is depends on format. Same as --outfile option of tabula-java.
  • options (str, optional) – Raw option string for tabula-java.

list of DataFrames or dict.

  • FileNotFoundError – If downloaded remote file doesn’t exist.
  • ValueError – If output_format is unknown format, or if downloaded remote file size is 0.
  • tabula.errors.CSVParseError – If pandas CSV parsing failed.
  • tabula.errors.JavaNotFoundError – If java is not installed or found.
  • subprocess.CalledProcessError – If tabula-java execution failed.


Here is a simple example. Note that read_pdf() only extract page 1 by default.

As of tabula-py 2.0.0, read_pdf() sets multiple_tables=True by default. If you want to get consistent output with previous version, set multiple_tables=False.
>>> import tabula
>>> pdf_path = ""
>>> tabula.read_pdf(pdf_path, stream=True)
[             Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2
5               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0     3     1
6            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0     3     4
7             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
8              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0     4     2
9              Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0     4     4
10            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0     4     4
11           Merc 450SE  16.4    8  275.8  180  3.07  4.070  17.40   0   0     3     3
12           Merc 450SL  17.3    8  275.8  180  3.07  3.730  17.60   0   0     3     3
13          Merc 450SLC  15.2    8  275.8  180  3.07  3.780  18.00   0   0     3     3
14   Cadillac Fleetwood  10.4    8  472.0  205  2.93  5.250  17.98   0   0     3     4
15  Lincoln Continental  10.4    8  460.0  215  3.00  5.424  17.82   0   0     3     4
16    Chrysler Imperial  14.7    8  440.0  230  3.23  5.345  17.42   0   0     3     4
17             Fiat 128  32.4    4   78.7   66  4.08  2.200  19.47   1   1     4     1
18          Honda Civic  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4     2
19       Toyota Corolla  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4     1
20        Toyota Corona  21.5    4  120.1   97  3.70  2.465  20.01   1   0     3     1
21     Dodge Challenger  15.5    8  318.0  150  2.76  3.520  16.87   0   0     3     2
22          AMC Javelin  15.2    8  304.0  150  3.15  3.435  17.30   0   0     3     2
23           Camaro Z28  13.3    8  350.0  245  3.73  3.840  15.41   0   0     3     4
24     Pontiac Firebird  19.2    8  400.0  175  3.08  3.845  17.05   0   0     3     2
25            Fiat X1-9  27.3    4   79.0   66  4.08  1.935  18.90   1   1     4     1
26        Porsche 914-2  26.0    4  120.3   91  4.43  2.140  16.70   0   1     5     2
27         Lotus Europa  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
28       Ford Pantera L  15.8    8  351.0  264  4.22  3.170  14.50   0   1     5     4
29         Ferrari Dino  19.7    6  145.0  175  3.62  2.770  15.50   0   1     5     6
30        Maserati Bora  15.0    8  301.0  335  3.54  3.570  14.60   0   1     5     8
31           Volvo 142E  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2]

If you want to extract all pages, set pages="all".

>>> dfs = tabula.read_pdf(pdf_path, pages="all")
>>> len(dfs)
>>> dfs
[       0    1      2    3     4      5      6   7   8     9
0    mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear
1   21.0    6  160.0  110  3.90  2.620  16.46   0   1     4
2   21.0    6  160.0  110  3.90  2.875  17.02   0   1     4
3   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4
4   21.4    6  258.0  110  3.08  3.215  19.44   1   0     3
5   18.7    8  360.0  175  3.15  3.440  17.02   0   0     3
6   18.1    6  225.0  105  2.76  3.460  20.22   1   0     3
7   14.3    8  360.0  245  3.21  3.570  15.84   0   0     3
8   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4
9   22.8    4  140.8   95  3.92  3.150  22.90   1   0     4
10  19.2    6  167.6  123  3.92  3.440  18.30   1   0     4
11  17.8    6  167.6  123  3.92  3.440  18.90   1   0     4
12  16.4    8  275.8  180  3.07  4.070  17.40   0   0     3
13  17.3    8  275.8  180  3.07  3.730  17.60   0   0     3
14  15.2    8  275.8  180  3.07  3.780  18.00   0   0     3
15  10.4    8  472.0  205  2.93  5.250  17.98   0   0     3
16  10.4    8  460.0  215  3.00  5.424  17.82   0   0     3
17  14.7    8  440.0  230  3.23  5.345  17.42   0   0     3
18  32.4    4   78.7   66  4.08  2.200  19.47   1   1     4
19  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4
20  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4
21  21.5    4  120.1   97  3.70  2.465  20.01   1   0     3
22  15.5    8  318.0  150  2.76  3.520  16.87   0   0     3
23  15.2    8  304.0  150  3.15  3.435  17.30   0   0     3
24  13.3    8  350.0  245  3.73  3.840  15.41   0   0     3
25  19.2    8  400.0  175  3.08  3.845  17.05   0   0     3
26  27.3    4   79.0   66  4.08  1.935  18.90   1   1     4
27  26.0    4  120.3   91  4.43  2.140  16.70   0   1     5
28  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5
29  15.8    8  351.0  264  4.22  3.170  14.50   0   1     5
30  19.7    6  145.0  175  3.62  2.770  15.50   0   1     5
31  15.0    8  301.0  335  3.54  3.570  14.60   0   1     5,               0            1             2            3        4
0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
1           5.1          3.5           1.4          0.2   setosa
2           4.9          3.0           1.4          0.2   setosa
3           4.7          3.2           1.3          0.2   setosa
4           4.6          3.1           1.5          0.2   setosa
5           5.0          3.6           1.4          0.2   setosa
6           5.4          3.9           1.7          0.4   setosa,      0             1            2             3            4          5
0  NaN  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
1  145           6.7          3.3           5.7          2.5  virginica
2  146           6.7          3.0           5.2          2.3  virginica
3  147           6.3          2.5           5.0          1.9  virginica
4  148           6.5          3.0           5.2          2.0  virginica
5  149           6.2          3.4           5.4          2.3  virginica
6  150           5.9          3.0           5.1          1.8  virginica,        0
0   supp
1     VC
2     VC
3     VC
4     VC
5     VC
6     VC
7     VC
8     VC
9     VC
10    VC
11    VC
12    VC
13    VC
14    VC] Union[IO, str, os.PathLike], template_path: Union[IO, str, os.PathLike], pandas_options: Optional[Dict[str, Any]] = None, encoding: str = 'utf-8', java_options: Optional[List[str]] = None, user_agent: Optional[str] = None, use_raw_url: bool = False, pages: Union[str, int, List[int], None] = None, guess: bool = False, area: Union[Iterable[float], Iterable[Iterable[float]], None] = None, relative_area: bool = False, lattice: bool = False, stream: bool = False, password: Optional[str] = None, silent: Optional[bool] = None, columns: Optional[List[float]] = None, relative_columns: bool = False, format: Optional[str] = None, batch: Optional[str] = None, output_path: Optional[str] = None, options: Optional[str] = None) → List[pandas.core.frame.DataFrame][source]

Read tables in PDF with a Tabula App template.

  • input_path (str, path object or file-like object) – File like object of target PDF file. It can be URL, which is downloaded by tabula-py automatically.
  • template_path (str, path object or file-like object) – File like object for Tabula app template. It can be URL, which is downloaded by tabula-py automatically.
  • pandas_options (dict, optional) – Set pandas options like {‘header’: None}.
  • encoding (str, optional) – Encoding type for pandas. Default is ‘utf-8’
  • java_options (list, optional) – Set java options like ["-Xmx256m"].
  • user_agent (str, optional) – Set a custom user-agent when download a pdf from a url. Otherwise it uses the default urllib.request user-agent.
  • use_raw_url (bool) – It enforces to use input_path string for url without quoting/dequoting. Default: False
  • pages (str, int, list of int, optional) –

    An optional values specifying pages to extract from. It allows str,`int`, list of :int. Default: 1


    '1-2,3', 'all', [1,2]

  • guess (bool, optional) –

    Guess the portion of the page to analyze per page. Default True If you use “area” option, this option becomes False.


    As of tabula-java 1.0.3, guess option becomes independent from lattice and stream option, you can use guess and lattice/stream option at the same time.

  • area (list of float, list of list of float, optional) –

    Portion of the page to analyze(top,left,bottom,right). Default is entire page.


    If you want to use multiple area options and extract in one table, it should be better to set multiple_tables=False for read_pdf()


    [269.875,12.75,790.5,561], [[12.1,20.5,30.1,50.2], [1.0,3.2,10.5,40.2]]

  • relative_area (bool, optional) – If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default False.
  • lattice (bool, optional) – Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • stream (bool, optional) – Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • password (str, optional) – Password to decrypt document. Default: empty
  • silent (bool, optional) – Suppress all stderr output.
  • columns (list, optional) –

    X coordinates of column boundaries.


    [10.1, 20.2, 30.3]

  • relative_columns (bool, optional) – If all values are between 0-100 (inclusive) and preceded by ‘%’, input will be taken as % of actual width of the page. Default False.
  • format (str, optional) – Format for output file or extracted object. ("CSV", "TSV", "JSON")
  • batch (str, optional) – Convert all PDF files in the provided directory. This argument should be directory path.
  • output_path (str, optional) – Output file path. File format of it is depends on format. Same as --outfile option of tabula-java.
  • options (str, optional) – Raw option string for tabula-java.

list of DataFrame.

  • FileNotFoundError – If downloaded remote file doesn’t exist.
  • ValueError – If output_format is unknown format, or if downloaded remote file size is 0.
  • tabula.errors.CSVParseError – If pandas CSV parsing failed.
  • tabula.errors.JavaNotFoundError – If java is not installed or found.
  • subprocess.CalledProcessError – If tabula-java execution failed.


You can use template file extracted by tabula app.

>>> import tabula
>>> tabula.read_pdf_with_template(pdf_path, "/path/to/data.tabula-template.json")
[             Unnamed: 0   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
0             Mazda RX4  21.0    6  160.0  110  ...  16.46   0   1     4     4
1         Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
2            Datsun 710  22.8    4  108.0   93  ...  18.61   1   1     4     1
3        Hornet 4 Drive  21.4    6  258.0  110  ...  19.44   1   0     3     1
4     Hornet Sportabout  18.7    8  360.0  175  ...  17.02   0   0     3     2
5               Valiant  18.1    6  225.0  105  ...  20.22   1   0     3     1
6            Duster 360  14.3    8  360.0  245  ...  15.84   0   0     3     4
7             Merc 240D  24.4    4  146.7   62  ...  20.00   1   0     4     2
8              Merc 230  22.8    4  140.8   95  ...  22.90   1   0     4     2
9              Merc 280  19.2    6  167.6  123  ...  18.30   1   0     4     4
10            Merc 280C  17.8    6  167.6  123  ...  18.90   1   0     4     4
11           Merc 450SE  16.4    8  275.8  180  ...  17.40   0   0     3     3
12           Merc 450SL  17.3    8  275.8  180  ...  17.60   0   0     3     3
13          Merc 450SLC  15.2    8  275.8  180  ...  18.00   0   0     3     3
14   Cadillac Fleetwood  10.4    8  472.0  205  ...  17.98   0   0     3     4
15  Lincoln Continental  10.4    8  460.0  215  ...  17.82   0   0     3     4
16    Chrysler Imperial  14.7    8  440.0  230  ...  17.42   0   0     3     4
17             Fiat 128  32.4    4   78.7   66  ...  19.47   1   1     4     1
18          Honda Civic  30.4    4   75.7   52  ...  18.52   1   1     4     2
19       Toyota Corolla  33.9    4   71.1   65  ...  19.90   1   1     4     1
20        Toyota Corona  21.5    4  120.1   97  ...  20.01   1   0     3     1
21     Dodge Challenger  15.5    8  318.0  150  ...  16.87   0   0     3     2
22          AMC Javelin  15.2    8  304.0  150  ...  17.30   0   0     3     2
23           Camaro Z28  13.3    8  350.0  245  ...  15.41   0   0     3     4
24     Pontiac Firebird  19.2    8  400.0  175  ...  17.05   0   0     3     2
25            Fiat X1-9  27.3    4   79.0   66  ...  18.90   1   1     4     1
26        Porsche 914-2  26.0    4  120.3   91  ...  16.70   0   1     5     2
27         Lotus Europa  30.4    4   95.1  113  ...  16.90   1   1     5     2
28       Ford Pantera L  15.8    8  351.0  264  ...  14.50   0   1     5     4
29         Ferrari Dino  19.7    6  145.0  175  ...  15.50   0   1     5     6
30        Maserati Bora  15.0    8  301.0  335  ...  14.60   0   1     5     8
31           Volvo 142E  21.4    4  121.0  109  ...  18.60   1   1     4     2
[32 rows x 12 columns],
    0            1             2            3        4
0  NaN  Sepal.Width  Petal.Length  Petal.Width  Species
1  5.1          3.5           1.4          0.2   setosa
2  4.9          3.0           1.4          0.2   setosa
3  4.7          3.2           1.3          0.2   setosa
4  4.6          3.1           1.5          0.2   setosa
5  5.0          3.6           1.4          0.2   setosa,
    0             1            2             3            4          5
0  NaN  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
1  145           6.7          3.3           5.7          2.5  virginica
2  146           6.7          3.0           5.2          2.3  virginica
3  147           6.3          2.5           5.0          1.9  virginica
4  148           6.5          3.0           5.2          2.0  virginica
5  149           6.2          3.4           5.4          2.3  virginica,
    Unnamed: 0 supp  dose
0          4.2   VC   0.5
1         11.5   VC   0.5
2          7.3   VC   0.5
3          5.8   VC   0.5
4          6.4   VC   0.5
5         10.0   VC   0.5
6         11.2   VC   0.5
7         11.2   VC   0.5
8          5.2   VC   0.5
9          7.0   VC   0.5
10        16.5   VC   1.0
11        16.5   VC   1.0
12        15.2   VC   1.0
13        17.3   VC   1.0]


Utility module providing some convenient functions.

class tabula.util.TabulaOption(pages: Union[str, int, List[int], None] = None, guess: bool = True, area: Union[Iterable[float], Iterable[Iterable[float]], None] = None, relative_area: bool = False, lattice: bool = False, stream: bool = False, password: Optional[str] = None, silent: Optional[bool] = None, columns: Optional[List[float]] = None, relative_columns: bool = False, format: Optional[str] = None, batch: Optional[str] = None, output_path: Optional[str] = None, options: Optional[str] = '', multiple_tables: bool = True)[source]

Bases: object

Build options for tabula-java

  • pages (str, int, list of int, optional) –

    An optional values specifying pages to extract from. It allows str,`int`, list of :int. Default: 1


    '1-2,3', 'all', [1,2]

  • guess (bool, optional) –

    Guess the portion of the page to analyze per page. Default True If you use “area” option, this option becomes False.


    As of tabula-java 1.0.3, guess option becomes independent from lattice and stream option, you can use guess and lattice/stream option at the same time.

  • area (list of float, list of list of float, optional) –

    Portion of the page to analyze(top,left,bottom,right). Default is entire page.


    If you want to use multiple area options and extract in one table, it should be better to set multiple_tables=False for read_pdf()


    [269.875,12.75,790.5,561], [[12.1,20.5,30.1,50.2], [1.0,3.2,10.5,40.2]]

  • relative_area (bool, optional) – If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default False.
  • lattice (bool, optional) – Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • stream (bool, optional) – Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
  • password (str, optional) – Password to decrypt document. Default: empty
  • silent (bool, optional) – Suppress all stderr output.
  • columns (list, optional) –

    X coordinates of column boundaries.


    [10.1, 20.2, 30.3]

  • relative_columns (bool, optional) – If all values are between 0-100 (inclusive) and preceded by ‘%’, input will be taken as % of actual width of the page. Default False.
  • format (str, optional) – Format for output file or extracted object. ("CSV", "TSV", "JSON")
  • batch (str, optional) – Convert all PDF files in the provided directory. This argument should be directory path.
  • output_path (str, optional) – Output file path. File format of it is depends on format. Same as --outfile option of tabula-java.
  • options (str, optional) – Raw option string for tabula-java.
  • multiple_tables (bool, optional) – Extract multiple tables into a dataframe. Default: True
area = None
batch = None
build_option_list() → List[str][source]

Convert to tabula-java option list

columns = None
format = None
guess = True
lattice = False
merge(other: tabula.util.TabulaOption) → tabula.util.TabulaOption[source]

Merge two TabulaOption. self will overwrite other fields’ values.

multiple_tables = True
options = ''
output_path = None
pages = None
password = None
relative_area = False
relative_columns = False
silent = None
stream = False
tabula.util.environment_info() → None[source]

Show environment information for reporting.

Returns:Detailed information like Python version, Java version, or OS environment, etc.
Return type:str
tabula.util.java_version() → str[source]

Show Java version

Returns:Result of java -version
Return type:str

Internal interfaces


tabula.template.load_template(path_or_buffer: Union[IO, str, os.PathLike]) → List[tabula.util.TabulaOption][source]

Build tabula-py option from template file

Parameters:path_or_buffer (str, path object or file-like object) – File like object of Tabula app template.
Returns:tabula-py options
Return type:dict


tabula.file_util.is_file_like(obj: Union[IO, str, os.PathLike]) → bool[source]

Check file like object

Parameters:obj – file like object.
Returns:file like object or not
Return type:bool
tabula.file_util.localize_file(path_or_buffer: Union[IO, str, os.PathLike], user_agent: Optional[str] = None, suffix: str = '.pdf', use_raw_url=False) → Tuple[str, bool][source]

Ensure localize target file.

If the target file is remote, this function fetches into local storage.

  • path_or_buffer (str) – File path or file like object or URL of target file.
  • user_agent (str, optional) – Set a custom user-agent when download a pdf from a url. Otherwise it uses the default urllib.request user-agent.
  • suffix (str, optional) – File extension to check.
  • use_raw_url (bool) – Use path_or_buffer without quoting/dequoting.

tuple of str and bool, which represents file name in local storage and temporary file flag.

Return type:

(str, bool)