"databases supervision sheet 1"

Written By Atticus Kuhn
Tags: "public", "project"
:PROPERTIES: :ID: 784f1c66-61e4-42bd-9fa6-a59c3b0a801e :mtime: 20231026120305 20231025035705 20231024040142 20231023003213 :ctime: 20231023003211 :END: #+title: databases supervision sheet 1 #+filetags: :public:project: * Problem 1 ** Question An relational implementation of an Entity-Relationship (ER) model typically attempts to avoid redundancy. But what does this mean exactly? Is redundancy the same thing as duplication of data values? ** Solution Data in a database is called *redundant* if it could be deleted and then recontructed from other data remaining in the database. So therefore, duplication of data value is a subset of redundant data. If data values are duplicated identically, then clearly that data is redundant. But, there exist cases where it would still be possible to reconstruct deleted data even if we do not have identical duplication. * Problem 2 ** Question Construct an Entity-Relationship model for the following scenario. Suppose we are conducting several experiments. Each experiment has a name and a description (text). Each experiment can be associated with many runs. Each run is associated with some input parameters and some output values. For intuition, consider the following example. Suppose our experiments use sim- ulation to explore the behavior of a distributed algorithm where nodes exchange messages with their neighbours and eventually compute some result. Each experi- ment might be associated with a different network topology such as linear sequence of n nodes, a ring of n nodes, an n by m grid of nodes, a clique of n, a binary tree of nodes with depth n, so on. The input parameters for each run might be a seed for a random number generator and one or more size-related parameters (depending on the experiment). The output of any run might be the total number of messages exchanged by the distributed algorithm and the time needed for termination. Each run of an experiment will update our database and our database should be able to support SQL queries that summarise the results. Yes, the specification is somewhat vague. Intentionally so! You may find that your are forced to make some simplifying assumptions in order to make progress. ** Solution #+BEGIN_SRC dot :file experiments.png :cmdline -Kdot -Tpng :exports both graph ER { layout=neato overlap="scalexy" node [shape=box , style=filled, color=lightblue]; experiment ; run; input_parameters ; output_parameters; topology ; node [shape=ellipse , style=filled , color = yellow]; name ; description ; seed ; size ; count_messages ; termination_time ; node [shape=diamond,style=filled,color=red]; run_experiment; inputs ; outputs ; experiment_topology ; experiment -- name; experiment -- description; experiment -- run_experiment; experiment -- experiment_topology ; experiment_topology -- topology ; run_experiment -- run; run -- inputs ; inputs -- input_parameters; run -- outputs ; outputs -- output_parameters; input_parameters -- size; input_parameters -- seed; output_parameters -- count_messages; output_parameters -- termination_time; } #+END_SRC #+RESULTS: [[file:experiments.png]] * Problem 3 ** Question Discuss possible relational implementations of your model from above. ** Solution #+BEGIN_SRC sql CREATE SCHEMA Experiments; GO CREATE TABLE Experiment (ExperimentID int, Name string, Description String); CREATE TABLE Run (RunID int , ExperiementID int, OutputID int , InputID int ); CREATE TABLE OutputParams (OutputID int , TerminationTime datetime , CountMessages int); CREATE TABLE InputParams (InputID int , Seed int , Size int); #+END_SRC * Problem 4 ** Question Suppose we have an experiment called “grid” that has input parameters “random seed”, “grid width”, and “grid height” with output parameters “message count” and “run time”. For fixed values of “grid width” and “grid height” we have run thousands of exper- iments just varying the “random seed” value. Using your implementation from Q3, we now want to write an SQL query that groups all runs of “grid” by “grid width” and “grid height” and returns for each group the average for each of the outputs. Recall that the average is computed by the aggregate function AVG. ** Solution #+BEGIN_SRC sql SELECT GridWidth, GridHeight, AVG(MessageCount) , AVG(RunTime) FROM Experiment JOIN Run on Run.ExperimentID = Experiment.ExperimentID JOIN InputParams ON InputParams.InputID = Run.InputID JOIN OutputParams ON OutputsParams.OutputID = Run.OutputID GROUP BY GridWidth , GridHeight #+END_SRC * Problem 5 ** Question If you still have time, please have a look at the Wide-angle Question Sheet. ** Solution https://www.cl.cam.ac.uk/teaching/2223/Databases/wide-range-and-tripos-1977/wide-range.html I did not finish all the questions. *** WRQ1. What is non-volatile storage? Why do most databases use it? When might it not be used for a database? Non-volatile storage is memory that persists after power-down on a computer. This is useful for a database because you do not want to lose all your data when you shut down. *** WRQ2. Why is DBMS support useful where times and dates are stored? There are some possible queries you might want to do that involve time - Sort entities by time (newest to oldest) - Group together entities created at the same time - Filter entities created before a certain cutoff time. *** WRQ3. Give an example of a many-to-many relationship that occurs in the real world (and is not an example straight from the lectures). Teacher-student: A teacher may teach many students, and a student may attend many teachers. *** WRQ4. Is it a good idea to use somebody's full name as a database key? If so, might it be best to hold it as one field (attribute) or should separate parts of the name be stored in different fields? It is not a good idea because 2 people may share the same name. If you were to store a name, it should be stored in a structured way, but I would have multiple subtypes of Name, such as EuropeanName , RussianName, and JavaneseName. (The reason is that Javanese people only have one name, and no surname or family name) *** WRQ5. Human-perceivable colours can be denoted in many forms using words and various numeric colour spaces. Is it a good idea to use a general-purpose database to map from one form to another? Probably not, considering that there are 16,777,216 colors supported by Hex. It would be a smarter idea to write a function of code, rather than a database. A function could store all this information in much less space. *** WRQ6. Computer power has greatly increased over the decades. Estimate the amount of data the Driver and Vehicle Licensing Agency in Swansea UK needs to hold. Estimate the number of queries and mutations of its data that happen per minute. Can the computer in your pocket handle this workload? Does the DVLA still require a 16-story building to do its data processing? *** WRQ7. The databases we consider in this course largely use discrete data and identity (equality) for key matching. How different does a geo-spatial database need to be, where much of the data stored is polygons (representing plots of land, pipeplines, counties etc.) and the queries involve more than identity? There are 2 main difficulties: 1) You need some way of storing polygons or spacial regions 2) You need some way of taking the intersection of regions *** WRQ8. Have a look at the syllabus for the Part Ia ML-&-Real-World-Data course. What key words or phrases might you use to distinguish the topics?

Leave your Feedback in the Comments Section