Find Jobs
Hire Freelancers

Geographic Data Scrapper (1114341)

$100-500 USD

Inställt
Publicerad ungefär 15 år sedan

$100-500 USD

Betalning vid leverans
Create a command-linePython program capable of scrapping places information from the ‘Satellite +old places’ map type on Wikimapia Beta website ??" <[login to view URL]> ??" given abounding box. The boundingbox is defined by a pair of coordinates ??" latitude and longitude (decimaldegrees) in WGS84 coordinate system ??" in the following format: (minimumlatitude, minimum longitude), (maximum latitude, maximum longitude). ## Deliverables # Scope Create a command-line Python program capable of scrapping places information from the 'Satellite + old places' map type on Wikimapia Beta website - <[login to view URL]> - given a bounding box. The bounding box is defined by a pair of coordinates - latitude and longitude (decimal degrees) in WGS84 coordinate system - in the following format: (minimum latitude, minimum longitude), (maximum latitude, maximum longitude). # Required Knowledge Python - good OO design and memory management skills, experience with Beautiful Soup (or equivalent) is recommended. Some experience with Google Maps API might be useful. # Specifications Target Operating Systems - Windows XP, Debian, Ubuntu Language - Python 2.5(+) Data Output Format - TSV, UTF-8 Geometries Format - Well-Known Text (WKT) strings (see <[login to view URL]>) Coordinate System - Latitude and Longitude decimal degrees on WGS84 # Deliverables (See also 'Project Milestones' below.) 1. Python script that fetches Wikimapia data for places in a given geographical area defined by a bounding-box; 2. Comprehensive documentation - user manual, setup and commented code; 3. Installer scripts for Windows XP, Debian and Ubuntu - listing any external dependencies and their setup procedures. # Requirements **Small Memory/Disk Usage Foot-Print** - the program has to use memory and disk space efficiently, via built-in house-keeping procedures to avoid leaving temporary files or to consume big chunks of memory unnecessarily. **No Wikimapia DOS** - the program has to have random time intervals between requests to Wikimapia website and/or other measures to avoid over-stressing Wikimapia resources. **Completeness** - the program has to account for the complete set of places existing in the given bounding-box. The places retrieval mechanism has to be aware of different map levels contents - not all places, if any, appear at every map level - and to be able to record information about every place present in the bounding-box once (and only once). **Tasks Script File** - the program has to be able to sub-divide a task into smaller tasks - e.g.: by sub-dividing the original bounding-box into smaller bounding-boxes - generating a tasks script. In order to be able to distribute a task across several machines, the program has to be able to interpret this tasks script - or a subset of it - and to process the sequence of tasks it describes. The tasks script can be an argument - as a path for a text file - for the command-line program and, when present, is a replacement of the bounding-box argument. The aggregation of results from the processing of several subsets of a tasks script by distinct program copies has to be equal to the processing of the complete tasks script by a sole copy of the program. **Log File** - the program has to have the ability to record (with time-stamps) its steps, warnings and errors in order to guarantee the possibility to restart a task from a specific point. **Data to Scrap** - the places' information to extract from Wikimapia is as follows: 1. Label - map place tooltip (equivalent to Google Maps API GMarker Title); 2. Outline or Envelope - polygon that defines the boundaries of the place (Note: 'old places' have envelopes, other places have outlines but all of these and those are polygons); 3. Centroid - coordinates in top right corner of info window, converted to decimal degrees; 4. Categories - text after "Category: " on info window; 5. Description - description in info window; 6. Permalink - permalink URL in info window; 7. Languages - language acronym in bottom right corner of info window; 8. Last Edit Date - converted in year/month/day format from text after "Edited: " in bottom left corner of info window. **Output Format** - the collected data is to be exported to a UTF-8 tab separated values file with 8 fields: 1. "label" - text; 2. "envelope" - WKT polygon string; 3. "centroid" - WKT point string; 4. "categories" - text, if multiple categories exist, separate by semi-colons; 5. "description" - text; 6. "permalink" - text; 7. "languages" - text, if multiple languages exist, separate by semi-colons; 8. "last_edit_date" - number, format 'yyyymmdd'. # Project Milestones If the developer agrees, partial payment will be processed on delivery and acceptance of the following working scripts: 1. **[40&percnt;]** Create a program that, given a bounding-box defined by a pair of coordinates: * Retrieves the above mentioned 'data to scrap' for the places present in the _highest level_ that encompass the bounding-box, and; * Produces a UTF-8 tab separated values file with the above mentioned 'output format' and fills it with the scraped data. 2. **[20&percnt;]** Create an evolution of the previous program that: * Retrieves the above mentioned 'data to scrap' for all the places present in _every level_ that encompass the bounding-box; * Registers, if requested, the steps, warnings and errors of the previous task in a 'log file' - one record per line with time-stamp, and; * Produces a UTF-8 tab separated values file with the above mentioned 'output format' with one (and only one) record (line) of scraped data per place. 3. **[40&percnt;]**Create the final version of the program, which is able to: * Generate a 'tasks script file' given a bounding-box, describing the data scrapping process in modular (atomic) steps in such way that subsets (lines) of that 'tasks script file' may be processed by independent machines (using the same final version of the program); * Retrieve the above mentioned 'data to scrap' for all the places present in _every level_ that encompass a bounding-box or the correspondent 'tasks script file' (or subset of it); * Register, if requested, the steps, warnings and errors of the previous task in a 'log file' - one record per line with time-stamp; * Collate scraped data resulting from the processing of different but related subsets of a 'tasks script file', and; * Produce a UTF-8 tab separated values file with the above mentioned 'output format', with each place that exists in the bounding-box - or equivalent 'tasks script file' - recorded once (and only once)
Project ID: 3668716

Om projektet

Distansprojekt
Senaste aktivitet 15 år sedan

Ute efter att tjäna lite pengar?

Fördelar med att lägga anbud hos Freelancer

Ange budget och tidsram
Få betalt för ditt arbete
Beskriv ditt förslag
Det är gratis att registrera sig och att lägga anbud på uppdrag

Om kunden

Flagga för UNITED ARAB EMIRATES
Dubai, United Arab Emirates
5,0
2
Medlem sedan feb. 27, 2009

Kundverifikation

Tack! Vi har skickat en länk för aktivering av gratis kredit.
Något gick fel med ditt e-postmeddelande. Vänligen försök igen.
Registrerade Användare Totalt antal jobb publicerade
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Laddar förhandsgranskning
Tillstånd beviljat för geolokalisering.
Din inloggningssession har löpt ut och du har blivit utloggad. Logga in igen.