I am quite a newbie (I think that's the term?) when it comes to Python coding and/or web scraping. When it comes to Python, I pretty much can do one thing, run code in the terminal. I am however very well versed in Statistical packages/languages (SAS/Stata/SQL) -- not sure if this helps me too much. I have been tasked with scraping a website, specifically http://ipr.etsi.org/IPRDetails.aspx?IPRD_ID=5&IPRD_TYPE_ID=2&MODE=2#
, for example, and pulling a bunch of information off of the "IPR information statement and licensing declaration" and "IPR information statement annex" tabs.
I would love to learn how to do this using Beautiful Soup, which I hear is a great tool for this kind of task. My end goal is to have code that loops through all the URLs from that website, scrolling where where "IPRD_ID = 5" and going from 1 through N. I would like my final output to look like what I have attached here in an Excel sheet, but I am very comfortable using Stata, SAS, Excel, etc. to reshape the data myself, so if I could get the data in some form from Beautiful Soup I would be able to place it into the format you see in the Excel sheet.
I would imagine this is probably a lot to ask someone to write this entire code out (again I have NO idea about BS right now
), but I would love it if someone could get me started by showing me how to extract a few of the fields I have listed here and then maybe I could write the rest of the code by mimicking?
Primarily, I would need to know how to
(1) Extract the headings/fields selected in Excel sheet from "IPR information statement and licensing" making sure they are connected to each "disclosure number"
(2) Extract a dummy (yes/no, 1/0, etc.) from the selectors to see if the organization is the proprietor and/or is prepared to granting a license
(3) Extract only the information regarding the "basis" patent (family members are much less important, as it seems like it would be very difficult to do this), but separately and marked accordingly for each disclosure number (in this case there are 5, but there really could be anywhere from 1 to N disclosure numbers on any individual page)
(4) Loop through websites where the "IPRD_ID" part of the URL can go from 1 through N
(5) Output all of this to CSV or some other format that I can access
I would GREATLY appreciate any help and forever be in anyone's debt! And if anyone needs any econometric help let me know and this could be symbiotic!
Thanks so much for looking!