Need help in debugging multiprocessing code in python

Adenitz · Post by **Adenitz** » Sat Sep 09, 2023 5:46 am

Hello to all,
I tried to write a simple script to convert xlsx files to pdf files using existing liber office installation.
My OS is Linux Mint 20 Ulyana.
After executing the following script, the code just hangs. Of course, sequential approach works.
Can you please give me advice what could be wrong and how to debug this code. I suspect that somehow two processes try to convert the same file.

Code: Select all

#!/usr/local/bin/python3

import os
import subprocess
import multiprocessing as mp
import time

def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    print('Started conversion of ', xlsx_file)
    subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file])
    print('Finished conversion of ', xlsx_file)
    

if __name__ == '__main__':

    start_t = time.time()

    input_directory = './XLSX/'
    output_directory = './PdfDir/'

    # Create folder if not exists
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
    # Replace extension was pdf now is xlsx
    already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]

    # List of all xlsx files
    xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]

    # List of xlsx files that actually needs to be converted to pdf
    xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]

    print('Length of the list is ', len(xls_files_to_be_converted) )

    # Multiprocessing conversion
    with mp.Pool(processes = mp.cpu_count()) as pool:
        result = pool.map(convert_pdf_soffice, xls_files_to_be_converted)

    #for file in xls_files_to_be_converted:
    #    convert_pdf_soffice(file)

    end_t = time.time()
    duration_t = end_t - start_t
    print(f'Duration is {duration_t}')

The output of the script is the following:

Code: Select all

Length of the list is  9
Started conversion of  ./XLSX/File6.xlsx
Started conversion of  ./XLSX/File2.xlsx
Started conversion of  ./XLSX/File1.xlsx
Started conversion of  ./XLSX/File7.xlsx
Started conversion of  ./XLSX/File3.xlsx
Started conversion of  ./XLSX/File5.xlsx

I have uploaded the whole test data and it can be found here: https://file.io/HpNAqdj4J2Wz

Post by **xenopeek** » Sat Sep 09, 2023 7:04 am

I found the multiprocessing module hard to use as well, and IIRC I couldn't get it to actually multiprocess Python code. Anyway…. Instead I've been using concurrent.futures.ProcessPoolExecutor https://docs.python.org/3/library/concu ... tures.html.

An example of how you could use it, add the code to populate the xls_files_to_be_converted list before the executor block (before the 'with blabla' line):

Code: Select all

#!/usr/bin/python3

import subprocess
import concurrent.futures

def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    _ = subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file])
    return xlsx_file

with concurrent.futures.ProcessPoolExecutor() as executor:
    for xlsx_file in executor.map(convert_pdf_soffice, xls_files_to_be_converted):
	print('Converted ', xlsx_file)

Adenitz · Post by **Adenitz** » Sat Sep 09, 2023 6:06 pm

Thank you for your help xenopeek, but it doesn0t solve the problem. This code behaves almost the same:

Code: Select all

#!/usr/local/bin/python3

import os
import time

import subprocess
import concurrent.futures

def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    _ = subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file])
    return xlsx_file

if __name__ == '__main__':
    start_t = time.time()
    input_directory = './XLSX/'
    output_directory = './PdfDir/'

    # Create folder if not exists
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
    # Replace extension was pdf now is xlsx
    already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]

    # List of all xlsx files
    xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]

    # List of xlsx files that actually need to be converted to pdf
    xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]

    print('Length of the list is', len(xls_files_to_be_converted))

    # Multiprocessing conversion
    with concurrent.futures.ProcessPoolExecutor() as executor:
         for xlsx_file in executor.map(convert_pdf_soffice, xls_files_to_be_converted):
             print('Converted ', xlsx_file)
    
    end_t = time.time()
    duration_t = end_t - start_t
    print(f'Duration is {duration_t}')

Adenitz · Post by **Adenitz** » Sun Sep 10, 2023 2:06 am

On the other hand,
if I try this script, it works, but output in the terminal is not clear to me:

Code: Select all

import concurrent.futures
import os
import subprocess, shutil

def rename_and_copy_file(input_filename):
    output_directory = './PdfDir/'
    prefix='F'
    # Check if the input file exists
    if not os.path.isfile(input_filename):
        print(f"File '{input_filename}' does not exist.")
        return

    # Split the input filename into its base name and extension
    base_name, file_extension = os.path.splitext(os.path.basename(input_filename))

    # Create the new filename with the prefix
    new_filename = prefix + base_name + file_extension

    # Create the full path for the output file
    output_filepath = os.path.join(output_directory, new_filename)

   
    # Rename and move the file to the output directory
    shutil.copy(input_filename, output_filepath)
    print(f"File '{input_filename}' renamed and copied to '{output_filepath}'")


def main():

    input_directory = './XLSX/'
    output_directory = './PdfDir/'
    existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
    # Replace extension was pdf now is xlsx
    already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]

    # List of all xlsx files
    xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]

    # List of xlsx files that actually need to be converted to pdf
    xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]


    with concurrent.futures.ProcessPoolExecutor() as executor:
        for xlsx_file in executor.map(rename_and_copy_file, xls_files_to_be_converted):
            print('Converted ', xlsx_file)

if __name__ == '__main__':
    main()

The output of the script is the following:

Code: Select all

File './XLSX/F_File6.xlsx' renamed and copied to './PdfDir/FF_File6.xlsx'
File './XLSX/F_File1.xlsx' renamed and copied to './PdfDir/FF_File1.xlsx'
File './XLSX/F_File7.xlsx' renamed and copied to './PdfDir/FF_File7.xlsx'
File './XLSX/F_File3.xlsx' renamed and copied to './PdfDir/FF_File3.xlsx'
File './XLSX/F_File9.xlsx' renamed and copied to './PdfDir/FF_File9.xlsx'
File './XLSX/F_File4.xlsx' renamed and copied to './PdfDir/FF_File4.xlsx'
File './XLSX/F_File8.xlsx' renamed and copied to './PdfDir/FF_File8.xlsx'
File './XLSX/F_File2.xlsx' renamed and copied to './PdfDir/FF_File2.xlsx'
Converted  None
File './XLSX/F_File5.xlsx' renamed and copied to './PdfDir/FF_File5.xlsx'
Converted  None
Converted  None
Converted  None
Converted  None
Converted  None
Converted  None
Converted  None
Converted  None

I don-t understand why so many outputs are printed in the terminal. It seems that many processes takes whole lists at the same time.

So if I do this, then output is as expected ( make two lists split in half).

Code: Select all

import concurrent.futures
import os
import shutil


def rename_and_copy(input_filenames, output_directory, prefix='F'):
    for input_filename in input_filenames:
        if not os.path.isfile(input_filename):
            print(f"File '{input_filename}' does not exist.")
            continue

        base_name, file_extension = os.path.splitext(os.path.basename(input_filename))
        new_filename = prefix + base_name + file_extension
        output_filepath = os.path.join(output_directory, new_filename)

        try:
            shutil.copy(input_filename, output_filepath)
            print(f"File '{input_filename}' renamed and copied to '{output_filepath}'")
        except Exception as e:
            print(f"An error occurred: {str(e)}")

def main():
    input_directory = './XLSX/'
    output_directory = './PdfDir/'
    existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
    already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]

    xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]
    xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]

    middle_index = len(xls_files_to_be_converted) // 2
    first_half = xls_files_to_be_converted[:middle_index]
    second_half = xls_files_to_be_converted[middle_index:]

    with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
        executor.submit(rename_and_copy, first_half, output_directory)
        executor.submit(rename_and_copy, second_half, output_directory)

if __name__ == '__main__':
    main()

But Is this a valid multiprocessing code?

Post by **xenopeek** » Sun Sep 10, 2023 11:17 am

If the code isn't faster than converting the files serial, maybe try setting the max_workers explicitly on the pool. For example to set it to 5 do it like this:

Code: Select all

    with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:

You're getting that "Converted None" in your output because your rename_and_copy_file function doesn't return the filename that should be printed.

In this code:

Code: Select all

    with concurrent.futures.ProcessPoolExecutor() as executor:
        for xlsx_file in executor.map(rename_and_copy_file, rename_and_copy_file):
            print('Converted ', xlsx_file)

the executor.map calls rename_and_copy_file for each item in rename_and_copy_file, paced by the pool of processes that can run at the same time, and saves the returned value in xlsx_file and then prints that. It will print None if your rename_and_copy_file function doesn't return the name of the file to print. Which it doesn't.

Why am I printing here and not in the function executor.map calls? Because print isn't atomic. And, well, my use of concurrent.futures.ProcessPoolExecutor is actually a bit more involved as I wrote it in a way that guarantees it will print items processed in the same order as the user gave them. The example I gave you doesn't do that but I still think it's a good idea to print from one process and not multiple. If you don't care for all that, simply replace the above code with:

Code: Select all

    with concurrent.futures.ProcessPoolExecutor() as executor:
        executor.map(rename_and_copy_file, rename_and_copy_file)

Linux Mint Forums

Need help in debugging multiprocessing code in python

Need help in debugging multiprocessing code in python

Re: Need help in debugging multiprocessing code in python

Re: Need help in debugging multiprocessing code in python

Re: Need help in debugging multiprocessing code in python

Re: Need help in debugging multiprocessing code in python