At first we will be create virtual environment for our project
$ python3 venv -m parsing_venv
$ source ./parsing_venv/bin/activate
Virtual environment not necessary, but very important if you want to keep your system clean
Then you need to install the modules that you need to work
$ pip install bs4 requests lxml
Now we must analyze the xml tree from specify URL, for example we using this link https://www.mapi.gov.il/ProfessionalInfo/Documents/dataGov/CITY.xml
<esri:Workspace xmlns:esri="http://www.esri.com/schemas/ArcGIS/9.3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<WorkspaceDefinition xsi:type="esri:WorkspaceDefinition">
...
</WorkspaceDefinition>
<WorkspaceData xsi:type="esri:WorkspaceData">
<DatasetData xsi:type="esri:TableData">
<DatasetName>city</DatasetName>
<DatasetType>esriDTFeatureClass</DatasetType>
<Data xsi:type="esri:RecordSet">
...
<Records xsi:type="esri:ArrayOfRecord">
<Record xsi:type="esri:Record">
<Values xsi:type="esri:ArrayOfValue">
<Value xsi:type="xs:int">1</Value>
<Value xsi:type="esri:PointN">
<X>184689.8424</X>
<Y>640598.3157</Y>
</Value>
<Value xsi:type="xs:int">1</Value>
<Value xsi:type="xs:short">862</Value>
<Value xsi:type="xs:string">גני יוחנן</Value>
<Value xsi:type="xs:int">536</Value>
<Value xsi:type="xs:short">31</Value>
<Value xsi:type="xs:string">מושבים (כפרים שיתופיים) (ב)</Value>
<Value xsi:type="xs:string">GANNE YOHANAN</Value>
</Values>
</Record>
...
</Records>
</Data>
</WorkspaceData
...
To get all the child elements of the Records tag, we will need to write a request for the parser through tags and attributes and, through iteration, get each value of the Record element
# import required modules
import bs4 as bs
import requests
def get_parsed_cities():
# assign URL
URL = 'https://www.mapi.gov.il/ProfessionalInfo/Documents/dataGov/CITY.xml'
# parsing
url_link = requests.get(URL)
url_link.encoding = 'utf-8'
file = bs.BeautifulSoup(url_link.text, features="xml")
find_table = file.find('WorkspaceData', {"xsi:type": "esri:WorkspaceData"}) #, class_='numpy-table' xsi:type="esri:WorkspaceData"
records = find_table.find_all('Record')
cities = []
for record in records:
record_id = record.find_all('Value')[0].text
x = record.find_all('Value')[1].find('X').text
y = record.find_all('Value')[1].find('Y').text
record_id_2 = record.find_all('Value')[2].text
city_id = record.find_all('Value')[3].text
city_name_heb = record.find_all('Value')[4].text
secondary_id = record.find_all('Value')[5].text
city_type_id = record.find_all('Value')[6].text
city_type_name = record.find_all('Value')[7].text
city_name_eng = record.find_all('Value')[8].text
city = {"record_id": record_id, "X":x, "Y":y, "record_id_2":record_id_2, "city_id":city_id, "city_name_heb":city_name_heb, "secondary_id":secondary_id, "city_type_id":city_type_id, "city_type_name":city_type_name, "city_name_eng":city_name_eng}
cities.append(city)
return cities