Extract the "Matrix form" dataset from BCS website

hongyi-zhao · December 22, 2022, 1:28pm

I want to extract / scrape the “Matrix form” dataset from the BCS website, as shown below:

I tried with the following python code snippet, but still failed to figure out the trick:

import requests
from bs4 import BeautifulSoup
import re

proxies = {
    'http': 'socks5h://127.0.0.1:18888',
    'https': 'socks5h://127.0.0.1:18888'
}

requests.packages.urllib3.disable_warnings()
r = requests.get('https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=17&type=plane', proxies=proxies, verify=False)
soup = BeautifulSoup(r.content, features="lxml")

table = soup.find('table')
id = table.find_all('id')

My python environment is as follows:

werner@X10DAi:~$ pyenv shell datasci 
(datasci) werner@X10DAi:~$ python --version
Python 3.11.1

Any tips will be appreciated.

Regards,
Zhao

facelessuser · December 22, 2022, 2:40pm

I’m not sure what your exact problem is as you did not describe what your problem is, only showed the code. But I do see some issues. I will ignore the proxy stuff as I’m not sure what you are trying to do there and will instead focus on the scraping part.

There are multiple tables in the HTML. If you are unsure why something isn’t working, the first step is to print out the content and view it directly to see if the HTML structure is as you expect:
```
print(soup)
```
It is clear from your description that you intend to extract the matrix, and looking at the HTML we retrieve, we can determine that is the second table on the page. You could acquire this simply by finding all tables grabbing the second table in the list:
```
table = soup.find_all('table')[1]
```
It seems you want to extract the id from the td elements under that table, but this cannot be done via table.find_all('id') as that is looking for a Tag called id, not the attribute id. To do this, you can use one of the following:
```
print(table.find_all(id=True))
print(table.find_all(attrs={'id': True}))
```
I personally prefer the second as I find find_all’s behavior of taking arbitrary keyword arguments to find attributes awkward, but you can use the first if you prefer that. Because we say 'id': True, we are simply saying if the attribute exists, return the element. If we had specified a string, it would find the element whose attribute matches that string.

Lastly, we can extract the id from each element returned as it seems to contain the matrix you desire. We can return a specific attribute by using element['attribute-name']

print([el['id'] for el in table.find_all(attrs={'id': True})])

Which gives you:

['[   1   0   ][   0   ]\n[   0   1   ][   0   ]', '[   0  -1   ][   0]\n[   1  -1   ][   0]\n', '[  -1   0   ][   0]\n[   0  -1   ][   0]\n', '[   0  -1   ][   0]\n[  -1   0   ][   0]\n']

At this point, it is simply a string parsing exercise.

Personally, I prefer using CSS selectors. Looking at the HTML, we can see that the desired table is the only one wrapped in the center tag, so we can actually extract the table directly nested under the center tag and then find all the td elements who have the id attribute:

print([el['id'] for el in soup.select('center > table td[id]')])

Just another approach.

facelessuser · December 22, 2022, 2:42pm

For reference, complete code showing both approaches:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=17&type=plane')
soup = BeautifulSoup(r.content, features="lxml")

table = soup.find_all('table')[1]
print([el['id'] for el in table.find_all(attrs={'id': True})])

print([el['id'] for el in soup.select('center > table td[id]')])

hongyi-zhao · December 23, 2022, 11:28am

I tried the following, but still can’t figure out how to match integer/rationals/fractions at the same time:

import requests
from bs4 import BeautifulSoup
import re
import numpy as np

proxies = {
    'http': 'socks5h://127.0.0.1:18888',
    'https': 'socks5h://127.0.0.1:18888'
}

requests.packages.urllib3.disable_warnings()
def getBCSGens(url):
    r = requests.get(url, proxies=proxies, verify=False)
    soup = BeautifulSoup(r.content, features="lxml")

    table = soup.find_all('table')[1]

    id = [el['id'] for el in soup.select('center > table td[id]')]

    m = []
    for e in id:
        i = [int(s) for s in re.findall(r'[-+]?\d+', e)]
        d = int(len(i)**(1/2))
        i = np.array(i + [0]*d+[1] )
        i = i.reshape(d+1,d+1).tolist()
        m.append(i)
        
    print(m)

url = 'https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=' + str(4) +'&type=plane'
getBCSGens(url)

Using the code snippet above, I encountered the following error:

File “/home/werner/Public/repo/github.com/gap-system/learning-by-doing/Mathematica-GAP-SpaceGroupIrep/Research-Interests/MinimalGeneratingSetOfSpaceGroup/data/bcs/bcs.py”, line 59, in
getBCSGens(url)
File “/home/werner/Public/repo/github.com/gap-system/learning-by-doing/Mathematica-GAP-SpaceGroupIrep/Research-Interests/MinimalGeneratingSetOfSpaceGroup/data/bcs/bcs.py”, line 53, in getBCSGens
i = i.reshape(d+1,d+1).tolist()

builtins.ValueError: cannot reshape array of size 10 into shape (3,3)

The culprit is that there is a fraction number in the webpage this time, as shown below:

So, what’s the correct method with regexp to match integer/rationals/fractions with possible preceding signs at the same time?

facelessuser · December 23, 2022, 4:20pm

Something like this could work.

RE_NUMS = re.compile(r'[-+]?[\d/]+')

This captures preceding +/- but also captures integers and fractions. I’m not sure about all the cases you need to capture, so I’m just keeping it to what you are showing. I’m sure if we wanted to we could capture scientific numbers 1e3 or decimals 0.3, but I’m going to keep it to what you are showing as I do not know all the cases you need to cover.

I did notice though that you are then converting the result to an integer, which won’t work for a fraction. I do not know exactly what you need, but, but I’ll assume for now that you need to preserve the fraction, so something like this below could work. It may be that you need something more advanced, but it should get you a foundation to build off of.

import requests
from bs4 import BeautifulSoup
import re
import numpy as np

RE_NUMS = re.compile(r'[-+]?[\d/]+')
URL = 'https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=4&type=plane'


def frac_to_float(s):
    if '/' in s:
        numerator, denominator = s.split('/')
        return float(numerator) / float(denominator)
    else:
        return float(s)


def getBCSGens(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, features="lxml")

    ids = [el['id'] for el in soup.select('center > table td[id]')]

    m = []
    for e in ids:
        i = [frac_to_float(s) for s in RE_NUMS.findall(e)]
        d = int(len(i) ** (1 / 2))
        i = np.array(i + [0] * d + [1] )
        i = i.reshape(d + 1, d + 1).tolist()
        m.append(i)

    print(m)


getBCSGens(URL)

Keep in mind that I do not know exactly what your end goal is, but I assume these are the results you are looking for.

[[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], [[-1.0, 0.0, 0.0], [0.0, 1.0, 0.5], [0.0, 0.0, 1.0]]]

hongyi-zhao · December 24, 2022, 3:09am

Thank you very much for wonderful tips and tricks.

Judging from my current situation, it seems that the following is enough:

RE_NUMS = re.compile(r'-?[\d/]+')

In my case, these data will be reused in GAP and must be expressed in fraction forms, as shown below:

gap> gens:=[[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], [[-1.0, 0.0, 0.0], [0.0, 1.0, 0.5], [0.0, 0.0, 1.0]]];
[ [ [ 1., 0., 0. ], [ 0., 1., 0. ], [ 0., 0., 1. ] ], [ [ -1., 0., 0. ], [ 0., 1., 0.5 ], [ 0., 0., 1. ] ] ]
gap> AffineCrystGroupOnLeft(last);
Error, no method found! For debugging hints type ?Recovery from NoMethodFound
Error, no 1st choice method found for `IsAffineCrystGroupOnRight' on 1 arguments at /home/werner/Public/repo/github.com/gap-system/gap.git/lib/methsel2.g:249 called from
IsAffineCrystGroupOnRight( S ) at /home/werner/Public/repo/github.com/gap-system/gap.git/pkg/cryst/gap/cryst.gi:814 called from
AsAffineCrystGroupOnLeft( G ) at /home/werner/Public/repo/github.com/gap-system/gap.git/pkg/cryst/gap/cryst.gi:720 called from
<function "AffineCrystGroupOnLeft">( <arguments> )
 called from read-eval loop at *stdin*:9
type 'quit;' to quit to outer loop
brk> 
gap> List(gens, x -> List(x, y -> List(y, z -> Rat(z)) ));
[ [ [ 1, 0, 0 ], [ 0, 1, 0 ], [ 0, 0, 1 ] ], [ [ -1, 0, 0 ], [ 0, 1, 1/2 ], [ 0, 0, 1 ] ] ]
gap> AffineCrystGroupOnLeft(last);
Group([ [ [ 1, 0, 0 ], [ 0, 1, 0 ], [ 0, 0, 1 ] ], [ [ -1, 0, 0 ], [ 0, 1, 1/2 ], [ 0, 0, 1 ] ] ])

As you can see, it doesn’t work when I represent them as floating point numbers.

Therefore, I try to adjust your code fragment to the following:

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
from fractions import Fraction

RE_NUMS = re.compile(r'-?[\d/]+')
URL = 'https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=4&type=plane'


def frac_to_str(s):
    if '/' in s:
        return str(Fraction(s))
    else:
        return str(s)


def getBCSGens(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, features="lxml")

    ids = [el['id'] for el in soup.select('center > table td[id]')]

    m = []
    for e in ids:
        i = [frac_to_str(s) for s in RE_NUMS.findall(e)]
        d = int(len(i) ** (1 / 2))
        i = np.array(i + ['0'] * d + ['1'] )
        i = i.reshape(d + 1, d + 1).tolist()
        m.append(i)

    print(m)

getBCSGens(URL)

With the above code, I got the following results:

[[['1', '0', '0'], ['0', '1', '0'], ['0', '0', '1']], [['-1', '0', '0'], ['0', '1', '1/2'], ['0', '0', '1']]]

But I don’t want to add a single quotes in it, that is, the form I want is as follows:

[ [ [ 1, 0, 0 ], [ 0, 1, 0 ], [ 0, 0, 1 ] ], [ [ -1, 0, 0 ], [ 0, 1, 1/2 ], [ 0, 0, 1 ] ] ]

facelessuser · December 24, 2022, 3:29am

Looks good, glad you were able to figure it out.

hongyi-zhao · December 24, 2022, 3:50am

But how do I get the above result directly in Python, that is, without all the single quotes?

facelessuser · December 24, 2022, 5:44am

But how do I get the above result directly in Python, that is, without all the single quotes?

That’s how lists print. When you print a list, it prints its entries as representations of what they are. Strings within a list are printed as quoted values, and you’ve stored all your values as strings. If you want to print a list “special”, you’ll need a special function to walk through the list to format and output it in the way that you desire.

I’m not sure why you are calling things like this: str(Fraction(s)). You essentially take a string, convert it to a fractional object, but then convert it right back to a string. It makes converting the string to a Fraction() pointless as you immediately undo it by calling str() on it. If you want them as strings, you don’t need to convert them to numbers at all.

I assume you wanted to keep the values as numbers, not strings, but tried converting them to strings because you wanted to print the fractions pretty.

So, here is an approach that keeps them as integers or Fractions (though you could just convert them all to Fractions for consistency). Then you can call matrix_print which will walk the list and print the results in such a way as to not have the quotes.

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
from fractions import Fraction

RE_NUMS = re.compile(r'-?[\d/]+')
URL = 'https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=4&type=plane'


def str_to_number(s):
    if '/' in s:
        return Fraction(s)
    else:
        return int(s)


def matrix_print(value, depth=0):
    """Print the matrix."""

    print('[', end='')
    for e, v in enumerate(value):
        if e:
            print(', ', end='')
        if isinstance(v, list):
            matrix_print(v, depth + 1)
        else:
            print(str(v), end='')
    print(']', end='' if depth else '\n')


def getBCSGens(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, features="lxml")

    ids = [el['id'] for el in soup.select('center > table td[id]')]

    m = []
    for e in ids:
        i = [str_to_number(s) for s in RE_NUMS.findall(e)]
        d = int(len(i) ** (1 / 2))
        i = np.array(i + [0] * d + [1] )
        i = i.reshape(d + 1, d + 1).tolist()
        m.append(i)

    matrix_print(m)

getBCSGens(URL)

Output

[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 1/2], [0, 0, 1]]]

hongyi-zhao · December 24, 2022, 7:18am

Thank you for your wonderful tricks and tips again.

I still have the following puzzles:

Why can’t the following simple method generate the final form discussed here?

In [49]: a=[0, 2/3, 0.5]

In [50]: [ repr(str(Fraction(s))).replace("'","") for s in a ]
Out[50]: ['0', '6004799503160661/9007199254740992', '1/2']

I want to obtain data in batches and merge them into a final list for output, but the following methods fail to achieve the goal:

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
from fractions import Fraction

RE_NUMS = re.compile(r'-?[\d/]+')

def str_to_number(s):
    if '/' in s:
        return Fraction(s)
    else:
        return int(s)


def matrix_print(value, depth=0):
    """Print the matrix."""

    print('[', end='')
    for e, v in enumerate(value):
        if e:
            print(', ', end='')
        if isinstance(v, list):
            matrix_print(v, depth + 1)
        else:
            print(str(v), end='')
    print(']', end='' if depth else '\n')


def getBCSGens(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, features="lxml")

    ids = [el['id'] for el in soup.select('center > table td[id]')]

    m = []
    for e in ids:
        i = [str_to_number(s) for s in RE_NUMS.findall(e)]
        d = int(len(i) ** (1 / 2))
        i = np.array(i + [0] * d + [1] )
        i = i.reshape(d + 1, d + 1).tolist()
        m.append(i)

    return matrix_print(m)


data = []
for i in range(1, 17+1):
    url = 'https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=' + str(i) + '&type=plane'
    data.append( getBCSGens(url) )

print(data)

The result is as follows:

[[[1, 0, 0], [0, 1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 1/2], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 0], [0, 0, 1]], [[1, 0, 1/2], [0, 1, 1/2], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[-1, 0, 1/2], [0, 1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[-1, 0, 1/2], [0, 1, 1/2], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 0], [0, 0, 1]], [[1, 0, 1/2], [0, 1, 1/2], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[0, -1, 0], [1, 0, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[0, -1, 0], [1, 0, 0], [0, 0, 1]], [[-1, 0, 0], [0, 1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[0, -1, 0], [1, 0, 0], [0, 0, 1]], [[-1, 0, 1/2], [0, 1, 1/2], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[0, -1, 0], [1, -1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[0, -1, 0], [1, -1, 0], [0, 0, 1]], [[0, -1, 0], [-1, 0, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[0, -1, 0], [1, -1, 0], [0, 0, 1]], [[0, 1, 0], [1, 0, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[0, -1, 0], [1, -1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]]]
[[[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[0, -1, 0], [1, -1, 0], [0, 0, 1]], [[-1, 0, 0], [0, -1, 0], [0, 0, 1]], [[0, -1, 0], [-1, 0, 0], [0, 0, 1]]]
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

As you can see, there are multiple columns in the table. If I want to scrape other columns, taking this webpage as an example, what features should be used to extract the contents of the following corresponding columns?

As one of the more advanced and complicated requirements, if I want to automatically scrape the sub data set which needs corresponding selections through buttons and clicks, as shown below:

How can I achieve this aim before I can use the script discussed here?

facelessuser · December 24, 2022, 3:10pm

Hongyi Zhao:

Why can’t the following simple method generate the final form discussed here?
In [49]: a=[0, 2/3, 0.5]

In [50]: [ repr(str(Fraction(s))).replace("'","") for s in a ]
Out[50]: ['0', '6004799503160661/9007199254740992', '1/2']

I still don’t understand your end goal as you have not explained it. I don’t know if you want to do calculations with the numbers later or if you are just trying to display them.

You keep taking strings and then converting them from strings to fractions and back to strings. If you print a list of strings, all string values will have quotes because it will print the representation of strings in the list, not the value of the string.

>>> val = "some string"
>>> print(val)
some string
>>> print(repr(val))
'some string'

When printing a list, it always prints the repr() of whatever is in the list, and since you keep putting strings in the list, they will keep getting printed with quotes. The return of both str() and repr() is a string.

I want to obtain data in batches and merge them into a final list for output, but the following methods fail to achieve the goal:

If you want one list that contains all the matrices, then instead of creating a new list each time getBCSGens gets called and then printing it, you will have to create a list before you call getBCSGens, pass it in and append to it every time you call getBCSGens, and then at the very end, print the entire data set. This is assuming I understand what you mean.

If the columns are in a different format, you will have to employ a different approach. If there is no ID to grab, you’ll have to grab the content of the column and potentially parse it in a slightly different way.

What have you tried? I cannot write the whole script for you. If you have a very specific issue with something you’ve tried, I am happy to help, but I do not have the time to write the entire solution for you.

Look at the real HTML output to understand how the structure differs, and then try a different approach to gather the matrix.

BeautifulSoup is great for parsing static pages, but pages that require JavaScript to run, and/or calls from the front end back to the back end in order to give you the page you desire are difficult for it. You will likely have to use something like the Python library selenium which will use a webdriver to essentially run the page in a browser instance to execute the necessary actions in the browser instance and then return the final page that BeautifulSoup can handle.

Yes, this is more advanced, and you’ll have to research how to install and use selenium to do so.

hongyi-zhao · December 24, 2022, 4:05pm

My ultimate goal related to the discussion here: As you can see, we’re scraping space group generators from BCS website, so my purpose is to use these data to create space groups in GAP. For this purpose, these data scraped from the website must be transformed to affine matrix acting on the left, as I’ve done in the previous discussion. For the concept of affine matrices acting on the left, you can see the corresponding description here, as shown below:

Yes. I must do group theory related calculations with these scraped numbers later instead of just trying to display them. For this purpose, all these numbers must be used in the exact rational form as they were provided, and no approximate rationalization of floating point numbers from these numbers and then back to rational number with different forms is allowed, otherwise the results will be unpredictable.

Yes. This is exactly what I want to do, and thank you for your advice. I’ll give it a try.

It seems all other columns don’t have the ID to grab, which is also the main source of my confusion.

If I have tried something, I will show my attempts and problems here again. Thank you very much for your reply and help.

If there is no ID to use, what are the other possible technologies? The difficulty here is that all these columns have no special labeling for me to distinguish.

facelessuser · December 24, 2022, 6:43pm

Okay, then using something like Fractions for your numbers is probably desired, though, it should be noted that not all numpy methods may support Fractions. I won’t speak further on this, but you may need to make adjustments depending on what tools you plan on using. I’m not a data scientist either, so while I dabble in many of these scientific tools, I am no expert in these areas.

If there is no ID to use, what are the other possible technologies? The difficulty here is that all these columns have no special labeling for me to distinguish.

There are quite a number of things you can do to get at the data you want. It depends greatly on the scenarios. The important thing is understanding the structure of the data you are parsing. You gave the example earlier of this webpage . These have no IDs that you can just extract the data from, but instead have a table nested in the 3rd column of the table under yet another table. You can manually navigate all this using BeautifulSoup. I usually prefer CSS selectors, but that is because I am also the author of the CSS selector library that BeautifulSoup depends on, so I have an obvious bias. You can learn about all the supported selectors here.

We simply target the desired elements which contain the data and extract the text.

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.cryst.ehu.es/cgi-bin/cryst/programs//nph-getgen?gnum=227&what=gen")
soup = BeautifulSoup(r.content, features="lxml")

data = [el.text for el in soup.select('center table tr:nth-child(n+3) > td:nth-child(3) table td:nth-child(2) pre')]
print(data)

Output

['  1  0  0   0\n  0  1  0   0\n  0  0  1   0', ' -1  0  0  3/4\n  0 -1  0  1/4\n  0  0  1  1/2', ' -1  0  0  1/4\n  0  1  0  1/2\n  0  0 -1  3/4', '  0  0  1   0\n  1  0  0   0\n  0  1  0   0', '  0  1  0  3/4\n  1  0  0  1/4\n  0  0 -1  1/2', ' -1  0  0   0\n  0 -1  0   0\n  0  0 -1   0', '  1  0  0   0\n  0  1  0  1/2\n  0  0  1  1/2', '  1  0  0  1/2\n  0  1  0   0\n  0  0  1  1/2']

I would spend some time learning more about the current libraries you are using and what they are capable of. Spend some time understanding how the data is stored in each given webpage so that you can understand what you have to navigate. You should also spend more time experimenting with what you’ve learned.

It may take some time to learn what is possible, but it will give you a better idea of what you can and cannot do. Armed with this knowledge, you should be able to scrape far more complicated and do far more complicated things in general.

hongyi-zhao · December 26, 2022, 2:16am

Can you give such a counterexample?

facelessuser · December 26, 2022, 3:46am

I’ve heard it mentioned from time to time in the past. Since I haven’t personally tried to use Fractions with numpy, I’ll leave it as simply hearsay. Libraries are always evolving though, so as I said, this may be the case, but I’m not stating that it is absolutely the case. Doing a quick search, you can find a few Fraction specific issues currently open on the numpy repository, whether any of this directly impacts you, I do not know. Just something to think about and be potentially aware of.

All I’m saying is there may be some methods that require primitive data types, but maybe things have improved since I’ve read about this or maybe it won’t impact the kind of calculations you are doing.

hongyi-zhao · December 26, 2022, 4:21am

Isaac Muse:

Output

['  1  0  0   0\n  0  1  0   0\n  0  0  1   0', ' -1  0  0  3/4\n  0 -1  0  1/4\n  0  0  1  1/2', ' -1  0  0  1/4\n  0  1  0  1/2\n  0  0 -1  3/4', '  0  0  1   0\n  1  0  0   0\n  0  1  0   0', '  0  1  0  3/4\n  1  0  0  1/4\n  0  0 -1  1/2', ' -1  0  0   0\n  0 -1  0   0\n  0  0 -1   0', '  1  0  0   0\n  0  1  0  1/2\n  0  0  1  1/2', '  1  0  0  1/2\n  0  1  0   0\n  0  0  1  1/2']

I try to convert the above output to numpy matrices as follows, but failed:

import numpy as np
data=['  1  0  0   0\n  0  1  0   0\n  0  0  1   0', ' -1  0  0  3/4\n  0 -1  0  1/4\n  0  0  1  1/2', ' -1  0  0  1/4\n  0  1  0  1/2\n  0  0 -1  3/4', '  0  0  1   0\n  1  0  0   0\n  0  1  0   0', '  0  1  0  3/4\n  1  0  0  1/4\n  0  0 -1  1/2', ' -1  0  0   0\n  0 -1  0   0\n  0  0 -1   0', '  1  0  0   0\n  0  1  0  1/2\n  0  0  1  1/2', '  1  0  0  1/2\n  0  1  0   0\n  0  0  1  1/2']

t = [s.split("\n") + ['0, 0, 0, 1'] for s in data]

[ np.matrix(';'.join(s) ) for s in t]
Traceback (most recent call last):
  Debug Console, prompt 72, line 1
    #!/usr/bin/env python
  Debug Console, prompt 72, line 1
    #!/usr/bin/env python
  File "/home/werner/.pyenv/versions/datasci/lib/python3.11/site-packages/numpy/matrixlib/defmatrix.py", line 142, in __new__
    data = _convert_from_string(data)
  File "/home/werner/.pyenv/versions/datasci/lib/python3.11/site-packages/numpy/matrixlib/defmatrix.py", line 26, in _convert_from_string
    newrow.extend(map(ast.literal_eval, temp))
  File "/home/werner/.pyenv/versions/3.11.1/lib/python3.11/ast.py", line 110, in literal_eval
    return _convert(node_or_string)
  File "/home/werner/.pyenv/versions/3.11.1/lib/python3.11/ast.py", line 109, in _convert
    return _convert_signed_num(node)
  File "/home/werner/.pyenv/versions/3.11.1/lib/python3.11/ast.py", line 83, in _convert_signed_num
    return _convert_num(node)
  File "/home/werner/.pyenv/versions/3.11.1/lib/python3.11/ast.py", line 74, in _convert_num
    _raise_malformed_node(node)
  File "/home/werner/.pyenv/versions/3.11.1/lib/python3.11/ast.py", line 71, in _raise_malformed_node
    raise ValueError(msg + f': {node!r}')
builtins.ValueError: malformed node or string on line 1: <ast.BinOp object at 0x14d17c034460>

facelessuser · December 26, 2022, 4:51am

It sounds like you are feeding in an unsupported input. You’ll need to print out what you are inputting and double-check the allowed format.

hongyi-zhao · December 26, 2022, 8:18am

Thank you for pointing out the correct troubleshooting and diagnosis direction. There is a form error of the inputting data. I have corrected it and rewritten it into the following code fragment based on your matrix_print function:

import requests
from bs4 import BeautifulSoup

requests.packages.urllib3.disable_warnings()
r = requests.get("https://www.cryst.ehu.es/cgi-bin/cryst/programs//nph-getgen?gnum=227&what=gen")
soup = BeautifulSoup(r.content, features="lxml")

def matrix_print(value, depth=0):
    """Print the matrix."""

    print('[', end='')
    for e, v in enumerate(value):
        if e:
            print(', ', end='')
        if isinstance(v, list):
            matrix_print(v, depth + 1)
        else:
            print(str(v), end='')
    #print(']', end='' if depth else '\n')
    print(']', end='')

data = [el.text for el in soup.select('center table tr:nth-child(n+3) > td:nth-child(3) table td:nth-child(2) pre')]
t=[s.split("\n") for s in data]
m=[[s.split() for s in i] + [ (len(i[0].split())-1) *['0'] +['1']] for i in t]

print('[', end='')
for i in range(len(m)):
    matrix_print(m[i])
    #if i != len(m) - 1:
        #print(',', end='\n')
    #else:
        #print(']', end='\n')
    print(',' if i != len(m) - 1 else ']')

This time, I got the correct final result as follows:

[[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]],
[[-1, 0, 0, 3/4], [0, -1, 0, 1/4], [0, 0, 1, 1/2], [0, 0, 0, 1]],
[[-1, 0, 0, 1/4], [0, 1, 0, 1/2], [0, 0, -1, 3/4], [0, 0, 0, 1]],
[[0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]],
[[0, 1, 0, 3/4], [1, 0, 0, 1/4], [0, 0, -1, 1/2], [0, 0, 0, 1]],
[[-1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]],
[[1, 0, 0, 0], [0, 1, 0, 1/2], [0, 0, 1, 1/2], [0, 0, 0, 1]],
[[1, 0, 0, 1/2], [0, 1, 0, 0], [0, 0, 1, 1/2], [0, 0, 0, 1]]]

Then I tried to apply the above method to other examples discussed earlier, however, some redundant empty sublists appeared in the final results:

import requests
from bs4 import BeautifulSoup

requests.packages.urllib3.disable_warnings()
url = "https://www.cryst.ehu.es/cgi-bin/plane/programs/nph-plane_getgen?gnum=8&type=plane"
#url = "https://www.cryst.ehu.es/cgi-bin/cryst/programs//nph-getgen?gnum=227&what=gen"
r = requests.get(url)

soup = BeautifulSoup(r.content, features="lxml")

def matrix_print(value, depth=0):
    """Print the matrix."""

    print('[', end='')
    for e, v in enumerate(value):
        if e:
            print(', ', end='')
        if isinstance(v, list):
            matrix_print(v, depth + 1)
        else:
            print(str(v), end='')
    #print(']', end='' if depth else '\n')
    print(']', end='')

data = [el.text for el in soup.select('center table tr:nth-child(n+3) > td:nth-child(3) table td:nth-child(2) pre')]
t=[s.split("\n") for s in data]
m=[[s.split() for s in i] + [ (len(i[0].split())-1) *['0'] +['1']] for i in t]

print('[', end='')
for i in range(len(m)):
    matrix_print(m[i])
    #if i != len(m) - 1:
        #print(',', end='\n')
    #else:
        #print(']', end='\n')
    print(',' if i != len(m) - 1 else ']')

This time, the result is as follows:

[[[1, 0, 0], [0, 1, 0], [0, 0, 1]],
[[-1, 0, 0], [0, -1, 0], [], [0, 0, 1]],
[[-1, 0, 1/2], [0, 1, 1/2], [], [0, 0, 1]]]

As you can see, there are 2 redundant empty sublists appeared in the final results.

P.S.: In my scenario, I need the exactly same form of the rational numbers as they were provided. So, I think it’s better to just stick to the string-printing based method and not use the fractions module at all.

By using the string-printing method, the representation of string will be obtained in its original provided form exactly, which’s exactly what I required.

facelessuser · December 26, 2022, 2:53pm

If you are parsing HTML on different pages and some pages keep the data in a different format, then you may not be able to use a single method of parsing that accommodates all scraped data. You may have to detect differences in the HTML and then use a parsing method design to parse data in that form. You had shown me at least two scenarios that do not parse the same. So it is likely that you get empty lists because you are not accounting for all the inconsistent ways in which the site will display the data you want.

Whether you keep the data in strings, Fractions, or floats doesn’t really matter. I thought you wanted to keep the data in some numerical form as you were going to do calculations later, but if you prefer strings, that is fine, you may have to convert the values to numerical form later, but that is up to you.

hongyi-zhao · December 26, 2022, 3:13pm

Why do you say that? Look at my example below. Let’s say I’m using floating-point numbers to represent three numbers that I’m grabbing in the form of rational numbers:

In [1]: a=[1/2,2/3,1/9]

In [2]: a
Out[2]: [0.5, 0.6666666666666666, 0.1111111111111111]

In this case, I’m not sure if it’s possible for me to convert back the last two floating-point numbers 0.6666666666666666 and 0.1111111111111111 to their original rational (fractional) forms, a.k.a., 2/3 and 1/9. Is it possible to switch back?

In theory, floating-point numbers are not really accurate numerical representations. Up to now, we have not realized the accurate representation of real numbers in computers. Only rational numbers are number fields that can actually be accurately expressed in computers. This is why the default implementation of GAP only supports calculations within the range of rational numbers. The processing and representation of other real numbers and special irrational numbers are much more complex.

Topic		Replies	Views
Failed to scrape data using scrapy Python Help help	2	2632	February 26, 2023
Python webscraping : table in the same url, but in multiple "pages" Python Help	16	600	January 18, 2024
Web scraping from scientific database Python Help	2	732	February 28, 2020
How to extract some data from web with Python? Python Help	6	248	January 31, 2024
BeautifulSoup Webscraper, getting text/date from "div" Python Help	3	2927	June 6, 2022

Extract the "Matrix form" dataset from BCS website

Related Topics