1. Databases

In march/2023, I created two databases of the decisions of the French Constitutional Council (Conseil Constitutionnel - CC), using the strategies described in the course Data Science e Direito (Data Science and Law, in Portuguese).

1.1 Databases in CSV

1.1 Databases for Excel

2. API-CC Database documentation

The first database is extracted from the API of the CC, available on the site of the French government dedicated to the open data: www.data.gouv.fr.

The URL of the API is https://www.data.gouv.fr/fr/datasets/constit-les-decisions-du-conseil-constitutionnel/.

Although available, the data is not organized in a table, what difficult its use for people without IT expertise. So, I downloaded and organized the information in the following CSV table.

The data is available in a tar.gz file, a format that is not readable for most software. So, I used a Python library named "tarfile" to decompress the files and save all information in the same folder (/CCdata).

2.1 Python

The data extraction and organization was made with Python codes.

If you are not proficient in Python, we strongly suggest you download the Anaconda platform, which installs Python and also the excellent Spyder editor, which I use for coding.

2.2 DSD module

The codes use the dsd module of functions. It's a simplified module, extracted from the dsd module we developed for scraping data from the Brazilian Supreme Court (Supremo Tribunal Federal). The complete library is available in a GitHub repository.

For the codes to function properly, you must save the module dsd.py in the working directory where the other codes will be registered.

2.3 Consolidate the data

With all the data saved in independent files in the same folder, we could organize the information in a structured way, converting int into a table.

# -*- coding: utf-8 -*-
"""
Created on Fri Mar  3 23:35:26 2023

@author: Alexandre Costa
"""

import os, dsd

path_source = os.getcwd() + '\\CCdata'
files = os.listdir(path_source)
filename = 'csv_api_cc.txt'
total = []
decisions = []

dsd.write_csv_row(filename, ['url_cc','identity','ancien_id','origine','url','nature','nature_qualifiee','titre','date_dec','juridiction','numero','solution','nor','titre_jo','ecli',' loi_def1','loi_def2','len(text_content)','text_content','closing',' saisine_content',' observations_content'])

for file in files:
# for file in range(1):
    identity = 'NA' 
    ancien_id = 'NA' 
    origine = 'NA' 
    url = 'NA'
    nature = 'NA' 
    titre = 'NA' 
    date_dec = 'NA'  
    juridiction = 'NA' 
    numero = 'NA'
    solution = 'NA' 
    nor = 'NA' 
    nature_qualifiee = 'NA' 
    titre_jo = 'NA' 
    url_cc = 'NA' 
    ecli = 'NA'
    text_content = 'NA'
    closing = 'NA'
    saisine_content = 'NA'
    observations_content = 'NA'
    
    
    decision = dsd.load_file(path_source + '\\' +  file)
    decision = decision.replace('>|','>')
    identity = dsd.extract_category(decision, 'ID')
    ancien_id = dsd.extract_category(decision, 'ANCIEN_ID') 
    origine = dsd.extract_category(decision, 'ORIGINE') 
    url = dsd.extract_category(decision, 'URL') 
    nature = dsd.extract_category(decision, 'NATURE') 
    titre = dsd.extract_category(decision, 'TITRE') 
    date_dec = dsd.extract_category(decision, 'DATE_DEC') 
    juridiction = dsd.extract_category(decision, 'JURIDICTION') 
    numero = dsd.extract_category(decision, 'NUMERO') 
    solution = dsd.extract_category(decision, 'SOLUTION') 
    nor = dsd.extract_category(decision, 'NOR') 
    nature_qualifiee = dsd.extract_category(decision, 'NATURE_QUALIFIEE') 
    titre_jo = dsd.extract_category(decision, 'TITRE_JO') 
    url_cc = dsd.extract_category(decision, 'URL_CC') 
    ecli = dsd.extract_category(decision, 'ECLI')
    
    loi_def = dsd.extract(decision,'LOI_DEF','<')
    loi_def1 = loi_def.split('>')[0].strip()
    if '2999-01-01' in loi_def1:
        loi_def1 = 'NA'
    loi_def2 = loi_def.split('>')[1]
    
    text_content = dsd.extract(decision,'<BLOC_TEXTUEL>\n<CONTENU>','</CONTENU>')
    text_content = text_content.upper()
    text_content = text_content.strip('<BR/>')
    text_content = text_content.strip('BR/>')
    text_content = text_content.strip()
    text_content = text_content.strip('<BR/>')
    text_content = text_content.strip('<BR/>')
    text_content = text_content.strip('BR/>')
    text_content = text_content.strip()
    text_content = text_content.strip('|')
    text_content = text_content.strip('<BR/>')
    text_content = text_content.strip('<BR/>')
    text_content = text_content.strip('BR/>')
    text_content = text_content.strip()
    text_content = text_content.strip('|')
    text_content = text_content.strip('<BR/>')
    text_content = text_content.strip()
    if text_content[0:len('<BR/>')] == '<BR/>':
        text_content = text_content[len('<BR/>'):]
    text_content = text_content.strip('<BR/>')
    
    text_content = text_content.replace('<BR/>JUGÉ PAR LE CONSEIL','<BR/>JUGÉ/DÉLIBÉRÉ PAR LE CONSEIL')
    text_content = text_content.replace('<BR/>DÉLIBÉRÉ PAR LE CONSEIL','<BR/>JUGÉ/DÉLIBÉRÉ PAR LE CONSEIL')
    if '<BR/>JUGÉ/DÉLIBÉRÉ PAR LE CONSEIL' in text_content:
        closing = 'JUGÉ/DÉLIBÉRÉ PAR LE CONSEIL' +   dsd.extract(text_content, '<BR/>JUGÉ/DÉLIBÉRÉ PAR LE CONSEIL', '')
        text_content = dsd.extract(text_content,'','<BR/>JUGÉ/DÉLIBÉRÉ PAR LE CONSEIL')


    
    saisine_content = dsd.extract(decision,'<SAISINE>\n<CONTENU>','</CONTENU>')
    observations_content = dsd.extract(decision,'<OBSERVATIONS>\n<CONTENU>','</CONTENU>')
    
    datalist =      [url_cc,
                     identity,
                     ancien_id,
                     origine,
                     url,
                     nature,
                     nature_qualifiee,
                     titre,
                     date_dec,
                     juridiction,
                     numero,
                     solution,
                     nor,
                     titre_jo,
                     ecli, 
                     loi_def1,
                     loi_def2,
                     len(text_content),
                     text_content[:5000],
                     closing, 
                     saisine_content, 
                     observations_content]
    
    print (url_cc)
    dsd.write_csv_row(filename, datalist)
    total.append(datalist)
    dsd.write_csv_row('decisions.txt', [url_cc,nature,solution,text_content, closing])
    if nature == 'QPC':
        dsd.write_csv_row('QPC_API.txt', datalist)
    if nature == 'AN':
        dsd.write_csv_row('AN_API.txt', datalist)

2.4 The files with the tables

The process creates the following file.

Although this file has complete data, but some of the values are strings that exceed the capacity of one Excel cell (30,000 characters), what difficult its visualization for most people.

For that reason, I truncated the longer values, limiting all strings to 5,000 characters. It generates a limited version, but one that can be managed as a xslx file.

3 DataCC Database

3.1 Necessity of the data scraping

The best feature of the API Database is the existence of several metadata fields, with many relevant categories, as result and date. Although, it has a big limitation: the decision text comes as a single string, without any markers that make it easier to parse the string and organize the information.

I noted that the webpage of the CC had these kinds of markers. I also identified that there is a page for each decision, what makes relatively simple to extract the data, what was possible with the following code, that generates a list of the links of the decisions.

The following code presents a query for the CC webpage and extract a list of the permalinks for all the decisions.

# -*- coding: utf-8 -*-
"""
Created on Thu Mar  2 15:12:46 2023

@author : Alexandre Costa
"""
import dsd
import os

# Define search to be analysed in the site of the Conseil Constitutionnel

search = 'https://recherche.conseil-constitutionnel.fr/?mid=a35262a4dccb2f69a36693ec74e69d26&filtres[]=type_doc%3AversionHTML&offsetCooc=&offsetDisplay=0&nbResultDisplay=10&nbCoocDisplay=&UseCluster=&cluster=&showExtr=&sortBy=date&typeQuery=3&dateBefore=&dateAfter=&xtmc=&xtnp=p1&rech_ok=1&datepicker=&date-from=1960-01-01&date-to=2023-03-02'

# Get the data of the search
html = dsd.get(search)
html = dsd.extract(html,'>Trier par<', '')

# Define the number of pages to be extracted.

pages = dsd.extract(html,'</div><span>sur ','<')
pages = int(pages)

datalist = []

# Define the number of links extracted
n = 0

# Sets a variable to stop the iteration
finish = 0

# Define the name of the file to generate with data
filename = 'List_of_links.txt'

# Identifies the last written decision, if there is already a file
if os.path.isfile(filename):
    last_decision = dsd.load_csv(filename)[0]
    first_decision = dsd.load_csv(filename)[-1]
else:
    last_decision = ''
    first_decision = ''

# Iterated extraction of pages of results
for page in range(pages):
    
    url = ('https://recherche.conseil-constitutionnel.fr/?mid=a35262a4dccb2f69a36693ec74e69d26&filtres[]=type_doc%3AversionHTML&offsetCooc=&offsetDisplay='
                   + str(page) +'0&nbResultDisplay=10&nbCoocDisplay=&UseCluster=&cluster=&showExtr=&sortBy=date&typeQuery=3&dateBefore=&dateAfter=&xtmc=&xtnp=p'
                   +str(page+1) +
                   '1&rech_ok=1&datepicker=&date-from=1960-01-01&date-to=2023-03-02')    
    
    data = dsd.get(url)
    data = dsd.extract(data,'>Trier par<', '')
    
    # Create a list from the data from the page
    data_page = dsd.extract_list(data, '<article role="article" class="type-decision">')
    
    # Extract the links
    for item in data_page:
        if finish == 1:
            break
        link = dsd.extract_link(item)
        if link:
            n = n+1
            print (n)
            decision = [link]
            if decision == last_decision or decision == first_decision:
                finish = 1
                break
            dsd.write_csv_row(filename, decision)
    
    if finish == 1:
        break
    
    
CC_List_of_links.py

This program generates a list of the links for the Conseil Constitutionnel.

3.3 Extract the data

With the list of links, we can extract the data and save all the decisions in a single CSV file.

@author : Alexandre Costa
"""

import dsd

# Define the number of links extracted
n = 0

# Define the name of the file with the list of urls
filelinks = 'List_of_links.txt'

# Define the name of the file to save with data
filename = 'Data_CC.txt'

# Load list from csv
urls = dsd.load_csv(filelinks)

datalist = []

# Extract data from url
for url in urls:
    n = n+1
    print (f'{n} of {len(urls)}, {url[0]}')
    html = dsd.get(url[0])
    html = dsd.extract(html,'<div  class="wrapper-content">','<div class="wrapper-dernieres-decisions">')
    html = html.strip()

    # actors = dsd.extract(html,'', '<p>LE CONSEIL CONSTITUTIONNEL,</p>')
    # vu = dsd.extract(html, '<p>LE CONSEIL CONSTITUTIONNEL,</p>','<p>Le rapporteur ayant été entendu&nbsp;;</p>')
    # consideranda = dsd.extract(html,'<p>Le rapporteur ayant été entendu&nbsp;;</p>','')
    
    data = [url[0], html]
    
    dsd.write_csv_row(filename, data)
CC_API_Datascraper.py

3.4 Organizing the data

After extracting all the information, I organized it in a table using the following code:

# -*- coding: utf-8 -*-
"""
Created on Mon Mar  6 14:13:49 2023

@author : Alexandre Costa
"""

import dsd

filesource = 'Data_CC.txt'
datafile = 'Data_CC2.txt'
total_data = []

datalist = dsd.load_csv('Data_CC.txt')

# Process data
for item in datalist:
    url_CC = item[0]
    text_total = item[1]
    
    # Define nature
    nature = url_CC.replace('.htm','')
    nature = nature.split('/')[-1][-4:]
    for character in nature:
        if character.isdigit():
            nature = nature.replace(character,'')
    
    # Define text_total
    text_total0 = text_total
    text_total = text_total.replace('|','')
    text_total = text_total.replace('<div class="clearfix text-formatted field field--name-field-contenu-original field--type-text-long field--label-hidden field__item">','')
    text_total = text_total.strip('<p>')
    text_total = dsd.extract(text_total, '', '<div class="cartouche-decision-print">')
    
    # Remove blockquote
    text_total_blockquote = dsd.extract(text_total, '<blockquote>', '</blockquote>')
    if text_total_blockquote != 'NA':
       text_total = text_total.replace(text_total_blockquote,'')
       
    juge_par = 'NA'
    if 'Jugé par le Conseil constitutionnel' in text_total_blockquote:
        juge_par = dsd.extract(text_total_blockquote,'Jugé par le Conseil constitutionnel','')
        text_total_blockquote = text_total_blockquote.split('Jugé par le Conseil constitutionnel')[0]
    
    if 'Délibéré par le Conseil constitutionnel' in text_total_blockquote:
        juge_par = dsd.extract(text_total_blockquote,'Délibéré par le Conseil constitutionnel','')
        text_total_blockquote = text_total_blockquote.split('Délibéré par le Conseil constitutionnel')[0]
    
    if 'LE CONSEIL CONSTITUTIONNEL DÉCIDE' in text_total_blockquote:
        text_total_blockquote = text_total_blockquote.replace('<sup>er</sup>','')
        articles_decisions = text_total_blockquote.split('<br />Article')[1:]
       
    if 'où siégeaient' in juge_par:
        session = dsd.extract(juge_par, 'du ','où siégeaient')
    else:
        session = 'NA'
    
    # Remove considerant
    text_total_considerant = dsd.extract(text_total, '<p class="considerant"', '<blockquote>')
    if text_total_considerant != 'NA':
        text_total = text_total.replace(text_total_considerant,'')
        text_total_considerant = text_total_considerant.split('<span class="numero-considerant"')[1:]
        for n in range(len(text_total_considerant)):
            text_total_considerant[n] = dsd.extract(text_total_considerant[n],'>','')
            text_total_considerant[n] = text_total_considerant[n].replace('</span>','')
            text_total_considerant[n] = text_total_considerant[n].replace('</p><p class="considerant">','')
            

    
    text_total_vu = dsd.extract(text_total, '<p>Au vu des textes suivants', 'Après avoir entendu le rapporteur')
    if text_total_vu != 'NA':
        text_total = text_total.replace(text_total_vu,'')
        au_vu_textes = dsd.extract(text_total_vu,'','Au vu des pièces suivantes :')
        au_vu_pieces = dsd.extract(text_total_vu,'Au vu des pièces suivantes :','')

    data_to_save = [url_CC,
                    nature,
                    len(text_total0),
                    len(text_total), text_total,
                    len(au_vu_textes), au_vu_textes,
                    len(au_vu_pieces), au_vu_pieces,
                    len(text_total_considerant), text_total_considerant,
                    len(text_total_blockquote), text_total_blockquote,
                    len(juge_par), juge_par,
                    len(articles_decisions), articles_decisions,
                    len(session), session
                    ]
    
    total_data.append(data_to_save)

head = ['url_CC',
        'nature',
        'len(text_total0)',
        'len(text_total)',' text_total',
        'len(au_vu_textes)',' au_vu_textes',
        'len(au_vu_pieces)','au_vu_pieces',
        'len(text_total_considerant)','text_total_considerant',
        'len(text_total_blockquote)','text_total_blockquote',
        'len(juge_par)',' juge_par',
        'len(articles_decisions)','articles_decisions',
        'len(session)',' session']

dsd.write_csv_row(datafile, head)                
dsd.write_csv_rows(datafile, total_data)

3.5 The final files

One more time, the result was good, but too big for Excel to handle. So, there is once more two files.