Marker GitHub repository

This is How I Convert PDF to Markdown

I have used multiple online tools to convert PDF documents to Markdown format, but none of them came close to Marker.

Along with basic Markdown conversion, it formats tables, converts most equations to latex, extracts and stores images.

Here’s how I use Marker to extract PDF content and convert them into valid Markdown.

Environment Used

Windows 11

Prerequisites

As per Marker’s GitHub repository, it requires the installation of:

  • Python
  • PyTorch

1. Install Python > 3.8

Go to Python Downloads page and download the latest version of Python.

Install the setup by following the instructions.

2. Install PyTorch

Note: For PyTorch to be installed correctly, you must have Python 3.8 or higher installed on your system.

To install PyTorch, go to its official website and you will something like the image below:

You can tweak those options to see which one would fit best for your system. Once you have your command, open PowerShell or Command Prompt and paste your command there.

Here’s the command that I used to install PyTorch:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

PyTorch will begin installing on your system…

It will take some time to download and install as the main file amounts to a size of 2.7 GB.

After a few minutes, PyTorch will be installed.

Now, the prerequisites are over. Next, you can go ahead and move to the actual Marker stuff.

Clone Marker

You can clone Marker project on your local system with the following command:

git clone https://github.com/VikParuchuri/marker.git

After cloning, the Marker GitHub repo would look something like this:

We have cloned the repository but still we cannot convert the PDFs into Markdown format as we haven’t installed Marker.

Steps to Install Marker

1. Create new environment

Outside of the newly cloned Marker GitHub repository, create a new environment for converting PDF to Markdown files.

python -m venv myenv

This will create a myenv folder consisting of multiple files.

2. Activate the environment

myenv\Scripts\activate

This will activate the the newly created environment.

3. Install “marker-pdf”

This command will actually install marker-pdf with the pip package manager.

pip install marker-pdf

Now we are ready to convert PDF documents to Markdown files!


4. Convert PDF format to Markdown

To convert a PDF to Markdown, we need two things:

  • Input path of the PDF 
  • Output path

Because the command for the conversion is something like this:

marker_single "input_path" "output_path" - batch_multiplier 2 - max_pages 12

Hence, inside the cloned marker GitHub project folder, I will create two folders:

  • pdfs: My input folder
  • output: My output folder

And I will use a sample PDF for the Markdown conversion and paste it inside the pdfs folder.

Now, to convert the PDF “Get_Started_With_Smallpdf.pdf”, I will use the following command:

marker_single "D:/projects/marker-pdf/marker/pdfs/Get_Started_With_Smallpdf.pdf" "D:/projects/marker-pdf/marker/output" - batch_multiplier 2 - max_pages 12

Here’s what the other two arguments mean according to Marker GitHub repo:

--batch_multiplier is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.

--max_pages is the maximum number of pages to process. Omit this to convert the entire document.

Once the command is executed, Marker will initiate the conversion and save the Markdown to the output folder.

The cool thing about Marker is that it extracts all the images associated with the PDF and stores it along with the main .md (Markdown) file.

It even generates a metadata file in json format
All the images are extracted in the .png format

Awesome! We have converted the PDF into Markdown. But wait!! How’s the output in Markdown look?

PDF Input

Here’s the PDF that we supplied Marker as the input file

Markdown Output

# Welcome To Smallpdf

Ready to take document management to the next level?

![0_image_0.png](0_image_0.png)

## Digital Documents—All In One Place

With the new Smallpdf experience, you can

![0_image_1.png](0_image_1.png) freely upload, organize, and share digital documents. When you enable the 'Storage' option, we'll also store all processed files here.

## Enhance Documents In One Click

When you right-click on a file, we'll present

![0_image_2.png](0_image_2.png) you with an array of options to convert, compress, or modify it.

## Access Files Anytime, Anywhere

You can access files stored on Smallpdf from

![0_image_3.png](0_image_3.png)

your computer, phone, or tablet. We'll also sync files from the Smallpdf Mobile App to our online portal

## Collaborate With Others

Forget mundane administrative tasks. With Smallpdf, you can request e-signatures, send large files, or even enable the Smallpdf G Suite App for your entire organization.

Pretty good, right?

Conclusion

In this tutorial, we used Marker to extract the content of a PDF and convert it into Markdown format. 

Of course, the PDF had only one page, but Marker is capable of a lot many pages and it does a fine job with that!

You can try and play with it yourself!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top