What is the role of data scientist?
In programming, a data scientist should be able to script, query, and store. In domain knowledge, a data scientist should have knowledge of industry, science, and finance. In communication, a data scientist should have knowledge of visualization, tabulation, and narratives
What Is Data Wrangling?
Data wrangling is the most important step in the data science process.
The quality of data analysis is only as good as the quality of data itself. The foundation of good data science comes down to good data.
For that, we need good data gathering, filtering, converting, exploring, and integrating of data.
This is data wrangling, and this five step process we carry out is our data wrangling pipeline, to gather, filter, convert, explore and integrate data.The first step, gathering data, is about extracting, parsing, and scraping data.The data may be in small pieces that eventually get appended or added together into big data.The next part is filtering or scrubbing data.
How to Convert CSV to JSON?
Converting from one data format to another is something that comes up again and again in data wrangling and in general data processing. If we have a particular tool chain, a set of software we use to process data, the output of one software might be incompatible with the input for another. So then we need an intermediary step to convert the data to the correct format.
import agatel tab1e = agatel.Table.from_csv(example.csv) tabte.to_json( example.json)
How to Convert XML to JSON?
Sometimes we have to convert from XML to JSON in data wrangling tasks. I will demonstrate converting from XML to JSON using the Python library xmltodict. Which takes in XML as input and converts it to a Python dictionary format. We are going to use this xmltodict function or library in Python.which we can install using pip.
import json,xmltodict parsed_xml = xmltodict.parse(open('example.xml').read()) with open('example.json', 'w') as output: json.dump(parsed_xml, output, indent=2, ensure_ascii=False)
How to convert Date?
from datetime import datetime dt1s = datetime.fromtimestamp(1374712329288).strftime('%Y-%m-%d') print(dt1s) dt2 = datetime.strptime("21/12/17 06:30", "%d/%m/%y %H:%M").isoformat() print(dt2) dt3 = datetime.strptime("11/24/17 16:30", "%m/%d/%y %H:%M").isoformat() print(dt3) timestamp = 122566527167.595983 dt4 = datetime.fromtimestamp(timestamp).isoformat() print(dt4)
How will you merge two xml files?
table1.html <table> <thead> <tr> <th> </th> <th>Country</th> </tr> </thead> <tbody> <tr> <td class="td1">1</td> <td class="td2">Finland</td> </tr> <tr> <td class="td1">2</td> <td class="td2">Norway</td> </tr> <tr> <td class="td1>3</td> <td class="td2">Denmark</td> </tr> <tr> <td class="td1">4</td> <td class="td2">Sweden</td> </tr> <tr> <td class="td1">5</td> <td class="td2">Switzerland</td> </tr> </tbody> </table> table2.html <table> <thead> <tr> <th> </th> <th>Country</th> </tr> </thead> <tbody> <tr> <td class="td1">6</td> <td class="td2">Canada</td> </tr> <tr> <td class="td1">7</td> <td class="td2">Iceland</td> </tr> <tr> <td class="td1>8</td> <td class="td2">Ireland</td> </tr> <tr> <td class="td1">9</td> <td class="td2">Belgium</td> </tr> <tr> <td class="td1">10</td> <td class="td2">Germany</td> </tr> </tbody> </table> BeautifulSoup from bs4 import BeautifulSoup with open('table1.html','r') as t1: table1 = t1.read() with open('table2.html','r') as t2: table2 = t2.read() soup1 = BeautifulSoup(table1, 'lxml') soup2 = BeautifulSoup(table2, 'lxml') tbody_dest = soup1.find('table').find('tbody') tr_source = soup2.find('table').find('tbody').find_all('tr') for trs in tr_source: tbody_dest.append(trs) print(tbody_dest)
How will you Normalize Data once you receive data?
Normalizing data is taking data from a column key format and transforming it into a key value format.I’ll demonstrate normalizing a CSV table using Python’s agate library. The table that I’m going to normalize or the data I’m going to normalize is in this names.csv.We have ID, name, birth_year, education, and salary.So the normalize function will normalize based around the ID. So we’ll have the ID as our key column. And then our property value combinations will be based on the other columns. So, name, birth_year, education, and salary will all be properties, and their values will follow in line on each individual row.
import agate table = agate.Table.from_csv('user.csv') norm = table.normalize('ID', ['name','birth_year','education','salary']) norm.to_csv('normalized_user.csv')
How will you Denormalize Data?
import agate table = agate.Table.from_csv('normalized_names.csv') denorm = table.denormalize('ID', 'property', 'value') denorm.to_csv('names.csv')
How will you draw Pivot Data Tables?
Pivoting is a process of aggregating based on grouped data in a summary column. This can be done quite readily for descriptive statistics.
import agate table = agate.Table.from_csv('exam.csv') ptable = table.pivot('test') ptable.to_csv('ptabl