Getting Started ================ Requirements ------------- * Java * Java 8+ * Python * 3.8+ Installation ------------ Before installing tabula-py, ensure you have Java runtime on your environment. You can install tabula-py from PyPI with ``pip`` command. .. code-block:: bash pip install tabula-py If you want to leverage faster execution with jpype, install with `jpype` extra. .. code-block:: bash pip install tabula-py[jpype] .. Note:: conda recipe on conda-forge is not maintained by us. We recommend installing via ``pip`` to use the latest version of tabula-py. Get tabula-py working (Windows 10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This instruction is originally written by `@lahoffm `_. Thanks! * If you don't have it already, install `Java `_ * Try to run an example code (replace the appropriate PDF file name). * If there's a ``FileNotFoundError`` when it calls ``read_pdf()``\ , and when you type ``java`` on command line it says ``'java' is not recognized as an internal or external command, operable program or batch file``\ , you should set ``PATH`` environment variable to point to the Java directory. * Find the main Java folder like ``jre...`` or ``jdk...``. On Windows 10 it was under ``C:\Program Files\Java`` * On Windows 10: **Control Panel** -> **System and Security** -> **System** -> **Advanced System Settings** -> **Environment Variables** -> Select **PATH** --> **Edit** * Add the ``bin`` folder like ``C:\Program Files\Java\jre1.8.0_144\bin``\ , hit OK a bunch of times. * On command line, ``java`` should now print a list of options, and ``tabula.read_pdf()`` should run. Example ------- tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save theĀ file as a CSV, a TSV, or a JSON. .. code-block:: py import tabula # Read pdf into a list of DataFrame dfs = tabula.read_pdf("test.pdf", pages='all') # Read remote pdf into a list of DataFrame dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf") # convert PDF into CSV tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all') # convert all PDFs in a directory tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all') See `example notebook `_ for more detail. I also recommend reading `the tutorial article `_ written by `@aegis4048 `_ and `another tutorial `_ written by `@tdpetrou `_. .. Note:: If you face some issues, we'd recommend trying `tabula.app `_ to see the limitation of tabula-java. Also, see :ref:`faq` as well.