es.davy.ai

Preguntas y respuestas de programación confiables

¿Tienes una pregunta?

Si tienes alguna pregunta, puedes hacerla a continuación o ingresar lo que estás buscando.

pdfplumber | Extraer texto de diseños de columnas dinámicas.

Tengo un código casi funcional que extrae la frase que contiene una expresión, a través de varias líneas.

Sin embargo, algunas páginas tienen columnas. Por lo tanto, las salidas respectivas son incorrectas; donde los textos separados se fusionan incorrectamente como una mala frase.

Este problema se ha abordado en las siguientes publicaciones:


Pregunta:

¿Cómo “condicionar” si hay columnas?

  • Las páginas pueden no tener columnas,
  • Las páginas pueden tener más de 2 columnas.
  • Las páginas también pueden tener encabezados y pies de página (que se pueden omitir).

Ejemplo .pdf con diseño de texto dinámico: PDF (pg. 2).

Notebook de Jupyter:


<h1>pip install PyPDF2</h1> <h1>pip install pdfplumber</h1> <h1>---</h1> <p>import pdfplumber</p> <h1>---</h1> <p>def scrape_sentence(phrase, lines, index): # - Reunir la sentencia en la que se encuentra "la expresión" - sentence = lines[index] print(" - Enunciado - ", sentence) print("longitud(líneas)", len(lines))</p> <pre><code># Líneas anteriores pre_i, flag = index, 0 while flag == 0: pre_i -= 1 if pre_i <= 0: break sentence = lines[pre_i] + sentence if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or ' • ' in lines[pre_i]: flag == 1 print("\n", sentence) # Líneas siguientes post_i, flag = index, 0 while flag == 0: post_i += 1 if post_i >= len(lines): break sentence = sentence + lines[post_i] if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or ' • ' in lines[pre_i]: flag == 1 print("\n", sentence) # -- Extraer -- sentence = sentence.replace('!', '.') sentence = sentence.replace('?', '.') sentence = sentence.split('.') sentence = [s for s in sentence if phrase in s] print(sentence) sentence = sentence[0].replace('\n', '').strip() # primera aparición print(sentence) return sentence </code></pre> <h1>---</h1> <p>frase = 'Gulf Petrochemical Industries Company'</p> <p>with pdfplumber.open('GPIC<em>Sustainability</em>Report<em>2016-v9</em>(lr).pdf') as opened<em>pdf: for page in opened</em>pdf.pages: text = page.extract<em>text() if text == None: continue lines = text.split('\n') i = 0 sentence = '' while i < len(lines): if phrase in lines[i]: sentence = scrape</em>sentence(phrase, lines, i) i += 1 ```</p> <strong>Ejemplo Salida Incorrecta:</strong>
  • Enunciado – being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of
    longitud(líneas) 47

Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of

Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report, Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
[‘ being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption’]
being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the

Tags:  , , , ,

Answer

  1. Avatar for davy.ai

    Para determinar si una página tiene columnas o no, podemos usar el método page.extract_tables() proporcionado por la biblioteca pdfplumber. Este método extrae todas las tablas de la página y las devuelve como una lista de listas. Si la página tiene columnas, este método extraerá el texto como tablas separadas para cada columna.

    Podemos usar la longitud de la lista devuelta como un indicador de si hay columnas o no. Si no hay columnas, entonces la longitud de la lista sería 1, de lo contrario sería mayor que 1.

    Aquí está el código modificado para verificar las columnas:

    import pdfplumber
    
    with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
        for page in opened_pdf.pages:
            # Extraer tablas
            tables = page.extract_tables()
    
            # Si hay una sola tabla, entonces no hay columnas
            if len(tables) == 1:
                text = page.extract_text()
            else:
                # Recorremos todas las tablas y concatenamos el texto
                text = ''
                for table in tables:
                    for row in table:
                        text += ' '.join(row) + '\n'
    
            # Hacer más procesamiento con el texto según sea necesario
            # ...
            # ...
    

    En este código modificado, primero extraemos todas las tablas de la página utilizando el método page.extract_tables(). Luego, verificamos la longitud de la lista devuelta para determinar si hay columnas o no. Si hay solo una tabla, entonces extraemos el texto de la página utilizando el método page.extract_text(). De lo contrario, recorremos todas las tablas y concatenamos el texto.

    Podemos hacer un procesamiento adicional con el texto extraído según sea necesario.

    Nota: Esta solución supone que si una página tiene columnas, entonces todo el texto de esa página está dispuesto en columnas. Si hay algunas partes de la página que no están dispuestas en columnas, entonces esta solución puede no funcionar correctamente.

Comments are closed.