1) Describe how to implement the following queries in MapReduce:
SELECT [login to view URL], [login to view URL], [login to view URL], [login to view URL], [login to view URL]
FROM Employee as emp, Agent as a
WHERE [login to view URL] = [login to view URL] AND [login to view URL] = [login to view URL];
SELECT lo_quantity, COUNT(lo_extendedprice)
FROM lineorder, dwdate
WHERE lo_orderdate = d_datekey
AND d_yearmonth = 'Feb1995'
AND lo_discount = 6
GROUP BY lo_quantity;
SELECT d_month, AVG(d_year)
FROM dwdate
GROUP BY d_month
ORDER BY AVG(d_year)
Consider a Hadoop job that processes an input data file of size equal to 179 disk blocks (179 different blocks, not considering HDFS replication factor). The mapper in this job requires 1 minute to read and fully process a single block of data. Reducer requires 1 second (not minute) to produce an answer for one key worth of values and there are a total of 3000 distinct keys (mappers generate a lot more key-value pairs, but keys only occur in the 1-3000 range for a total of 3000 unique entries). Assume that each node has a reducer and that the keys are distributed evenly.
The total cost will consist of time to perform the Map phase plus the cost to perform the Reduce phase.
How long will it take to complete the job if you only had one Hadoop worker node? For simplicity, assume that that only one mapper and only one reducer are created on every node.
30 Hadoop worker nodes?
60 Hadoop worker nodes?
100 Hadoop worker nodes?
Would changing the replication factor have any affect your answers for a-d?
You can ignore the network transfer costs as well as the possibility of node failure.
Suppose you have an 8-node cluster with replication factor of 3. Describe what MapReduce has to do after it determines that a node has crashed while a job is being processed. For simplicity, assume that the failed node is not replaced and your cluster is reduced to 7 nodes. Specifically:
What does HDFS (the storage layer/NameNode) have to do in response to node failure in this case? I.e., what is the guarantee that HDFS has to maintain?
What does MapReduce engine (the execution layer) have to do to respond to the node failure? Assume that there was a job in progress at the time of the crash (because MapReduce engine only needs to take action if a job was in progress).
Where does the Mapper store output key-value pairs before they are sent to Reducers?
Can Reducers begin processing before Mapper phase is complete? Why or why not?
Repeat the RSA computation examples by
a) Select two (small) primes and generate a public-private key pair.
b) Compute a sample ciphertext using your public key
c) Decrypt your ciphertext from 4-b using the private key
d) Why can’t the encrypted message sent through this mechanism be larger than the value of
n?
I will write mapreduce code for hadoop and spark
I will provide services like,
Hadoop cluster setup
Hadoop map-reduce programming
Apache spark cluster setup
Apache spark map-reduce programming
AWS EMR cluster setup & map-reduce programming