Extracting Part of a link

GMD · March 23, 2021, 6:04pm

I have an img tag that looks like this:

<img class="mlb-player-image lazyloaded" data- 
srcset="/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard 1x, 
/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard@2x 2x" alt="Francisco Alvarez" src="/img/silhouette.gif" 
srcset="" style="">

What would the code be to extract the number behind /images/682626?

Thank you
GMD

erlendaasland · March 23, 2021, 8:39pm

I’d suggest taking a look at the html.parser module for parsing the tag, and the re module for extracting text.

GMD · March 24, 2021, 12:50pm

I probably didn’t ask the right question. The IMG tag is a bs4.object that was extracted as part of a list. There are 30 list elements. This is one of the elements:

<li class="col-xs-6"><div class="rank">1</div><div class="prospect-image"><img alt="Francisco Alvarez" class="lazyload mlb-player-image" data-aspectratio="45/68" data-srcset="/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg?preset=teamPlayerCard 1x, /remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg?preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a href="#104328"> Francisco Alvarez </a> <span class="prospect-position">C</span></div></li>

I’m trying to get the player ID from that. It doesn’t really have a tag or part of the text of a tag that I can figure out how to grab.

Thanks

GMD · March 25, 2021, 12:02am

I have been working on the code and this is what I’ve come up with:

for first_player in my_list:
    rank = first_player.div.text
    ranks.append(rank)
    name = first_player.a.text.strip()
    names.append(name)
    position = first_player.span.text
    positions.append(position)
    player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]
    player_ids.append(player_id)

This has worked on 27 of 30 webpages. I’m stumped as to why it doesn’t work on the other 3. The pages are the same but with different players on them. On the ones that it doesn’t work I get this error:

Traceback (most recent call last):
File "C:/Users/HP/PycharmProjects/CTCBL/NYN.py", line 54, in <module>
player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]
KeyError: 'data-srcset'

This is the list item it is using:

<li class="col-xs-6"><div class="rank">2</div><div class="prospect-image"><img 
alt="Ronny Mauricio" class="lazyload mlb-player-image" data-aspectratio="45/68" data- 
srcset="/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg? 
preset=teamPlayerCard 1x,/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg? 
preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a 
href="#100109"> Ronny Mauricio </a> <span class="prospect-position">SS</span> 
</div></li>

What is also perplexing is that if I take the code out of the loop it runs for the first list item.

Thanks for any help you can give.

Maroloccio · March 25, 2021, 8:27am

That link does not have a ‘data-srcset’ attribute, it only has these
(which you get via pprint.pprint(first_player.find('img').attrs):

    {'alt': 'Ronny Mauricio',
     'class': ['lazyload', 'mlb-player-image'],
     'data-': '',
     'data-aspectratio': '45/68',
     'srcset': '/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg?\n'
               'preset=teamPlayerCard '
               '1x,/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg?\n'
               'preset=teamPlayerCard@2x 2x'}

erlendaasland · March 25, 2021, 9:03am

You can use a try … catch statement to filter out tags without the data-srcset attribute:

try:
    player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]
except KeyError:
    pass  # put any error handling here
else:
    player_ids.append(player_id)

GMD · March 25, 2021, 1:29pm

Thanks for the reply. Why does the code work on other pages? There are 30 team pages and it works on 27 of them. This is an object in the list that it does work on:

<li class="col-xs-6"><div class="rank">1</div><div class="prospect-image"><img 
alt="Cristian Pache" class="lazyload mlb-player-image" data-aspectratio="45/68" data- 
srcset="/remote.axd/www.milb.com/images/665506/generic/180x270/665506.jpg? 
preset=teamPlayerCard 1x, 
/remote.axd/www.milb.com/images/665506/generic/180x270/665506.jpg? 
preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a href="#8776"> 
Cristian Pache </a> <span class="prospect-position">OF</span></div></li>

This one Doesn’t work. I get the error.

<li class="col-xs-6"><div class="rank">1</div><div class="prospect-image"><img 
alt="Francisco Alvarez" class="lazyload mlb-player-image" data-aspectratio="45/68" data- 
srcset="/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard 1x, 
/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a 
href="#104328"> Francisco Alvarez </a> <span class="prospect-position">C</span> 
</div></li>

if I use the code to get the first item it also works:

rank = first_player.div.text
name = first_player.a.text
position = first_player.span.text
player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]

print(rank, name, position, player_id)

I get:
1 Francisco Alvarez C 682626

Thanks again for helping

Maroloccio · March 25, 2021, 1:54pm

It is likely that the problem lies in other items of that list that do not have the attribute.

These two you pasted seem fine (see attachment).

Check whether all elements (i.e. <li> tags) of the list have the data-srcset="..." attribute.

GMD · March 25, 2021, 5:53pm

Marco, THANK YOU. I have learned something here. It doesn’t necessarily do the operations up till it finds an error it just errors. There was one element that didn’t have the data-srcset attribute. Because I’m not adept yet enough to do the code to handle errors, I just put a line of code in to remove item from the list and then inserted another one with corrected data in the spot. Now I will inspect the other 2 pages and find which element is causing those to error out.

Thanks again

Maroloccio · March 25, 2021, 6:28pm

Pleasure to help. Do consider the suggestion you got from another user of
wrapping your metadata extraction logic in a try/except block so that you may
process the items you can and disregard those you can’t without halting.

You could also log information about items you could not process to study them
or process them with another strategy later. Useful topics: “Python logging”,
“difference between stdout and stderr”.

GMD · March 25, 2021, 8:08pm

Yes, that definitely is on my agenda. This was my first attempt at getting some baseball data for my personal projects. I plan on doing some more reading and expanding my programs. It is all a learning process.

Thanks again to you and to Erlend!

erlendaasland · March 25, 2021, 8:11pm

Glad to help!