Skip to content

Commit e391b7e

Browse files
authored
Update README.md
1 parent 9d251c0 commit e391b7e

File tree

1 file changed

+28
-41
lines changed

1 file changed

+28
-41
lines changed

README.md

Lines changed: 28 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@ over large datasets
1313
<hr>
1414

1515
## About dataset
16-
MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.
17-
This data set consists of
18-
100,000 ratings (1-5) from 943 users upon 1682 movies.
19-
Each user has rated at least 20 movies.
20-
Simple demographic info for the users (age, gender, occupation, zip)
16+
MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.</br>
17+
This data set consists of</br>
18+
100,000 ratings (1-5) from 943 users upon 1682 movies</br>
19+
Each user has rated at least 20 movies</br>
20+
Simple demographic info for the users (age, gender, occupation, zip)</br>
2121

2222
Dataset Link: https://grouplens.org/datasets/movielens/1m/
2323
<hr>
@@ -30,59 +30,44 @@ Cloudera Quickstart VM, Winscp, Putty,
3030
## Extract and Transform the Data
3131
* Import the ml-1m file to clouderavm through winscp
3232

33-
* File is delimited by :: . Change the delimiters to comma formatted, (csv) </br>
33+
* File is delimited by :: . Change the delimiters to comma formatted, (csv)
3434

3535
![image](https://user-images.githubusercontent.com/69738890/95400797-2fea3f80-08d1-11eb-94e9-f73a742cfd17.png)
3636

37-
</br>
38-
<code>
39-
sed -i 's/::/,/g' ml-1m/movies.dat</br>
40-
sed -i 's/::/,/g' ml-1m/users.dat</br>
41-
sed -i 's/::/,/g' ml-1m/ratings.dat </br>
42-
</code>
43-
</br>
37+
sed -i 's/::/,/g' ml-1m/movies.dat
4438

45-
![image](https://user-images.githubusercontent.com/69738890/95400931-8c4d5f00-08d1-11eb-8425-fcbecbb55146.png)
39+
sed -i 's/::/,/g' ml-1m/users.dat
4640

47-
</br>
41+
sed -i 's/::/,/g' ml-1m/ratings.dat
4842

49-
* Rename file format from .dat to .csv </br>
43+
![image](https://user-images.githubusercontent.com/69738890/95400931-8c4d5f00-08d1-11eb-8425-fcbecbb55146.png)
5044

51-
<code>
52-
mv ml-1m/movies.dat /ml-1m/movies.csv </br>
53-
mv ml-1m/ratings.dat /ml-1m/ratings.csv </br>
54-
mv ml-1m/users.dat /ml-1m/users.csv </br>
55-
</code>
56-
</br>
45+
* Rename file format from .dat to .csv
5746

58-
* Move the data into HDFS folder Movie_Lens,folder structure Movie_Lens/ml-1m </br>
47+
<code> mv ml-1m/movies.dat /ml-1m/movies.csv </code>
48+
<code> mv ml-1m/ratings.dat /ml-1m/ratings.csv </code>
49+
<code> mv ml-1m/users.dat /ml-1m/users.csv </code>
5950

60-
![image](https://user-images.githubusercontent.com/69738890/95402279-e7348580-08d4-11eb-9eb4-401619535409.png)
51+
* Move the data into HDFS folder Movie_Lens,folder structure Movie_Lens/ml-1m
6152

62-
</br>
53+
![image](https://user-images.githubusercontent.com/69738890/95402279-e7348580-08d4-11eb-9eb4-401619535409.png)
6354

6455
* Create movies.sql,ratings.sql,users.sql</br>
65-
<code>
66-
nano movies.sql </br>
67-
nano ratings.sql </br>
68-
nano users.sql </br>
69-
</code>
56+
<code> nano movies.sql </code>
57+
<code> nano ratings.sql </code>
58+
<code> nano users.sql </code>
7059

71-
Copy SQL code from the repo files movies.sql,ratings.sql,users.sql </br>
72-
<code></br>
73-
hive -f users.sql</br>
74-
</code></br>
60+
Copy SQL code from the repo files movies.sql,ratings.sql,users.sql
61+
<code> hive -f users.sql </code>
7562

7663
![image](https://user-images.githubusercontent.com/69738890/95402545-a1c48800-08d5-11eb-9b59-3a7051eaea5c.png)
7764

78-
</br>
79-
8065
OR manually execute the commands in the hive shell as shown below
8166

8267
![image](https://user-images.githubusercontent.com/69738890/95404381-7bedb200-08da-11eb-8aee-cb0f2d432d13.png)
8368

8469
# EXPLORED QUESTIONS
85-
Top 10 viewed movies
70+
##### Top 10 viewed movies</br>
8671
<CODE>
8772
SELECT movies.MovieID,movies.Title,COUNT(DISTINCT ratings.UserID) as views
8873
FROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID)
@@ -95,7 +80,7 @@ LIMIT 10;
9580
![image](https://user-images.githubusercontent.com/69738890/95404826-bb68ce00-08db-11eb-94c1-bbf7bca70d1c.png)
9681

9782
</BR>
98-
Top 20 rated movies having at least 40 views
83+
##### Top 20 rated movies having at least 40 views</br>
9984
<CODE>
10085
SELECT movies.MovieID,movies.Title,AVG(ratings.Rating) as rtg,COUNT(DISTINCT ratings.UserID) as views
10186
FROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID)
@@ -121,10 +106,12 @@ create view movie_by_genre as select movieid, genre from (select movieid, split(
121106

122107
Find top 3 genres for each user</br>
123108
<CODE>
124-
create temporary table movie_by_user_genre as select t1.*, t2.rating,t2.userid from movie_by_genre t1 left join ratings t2 on t1.movieid = t2.movieid where t2.rating >= 4;</br>
125-
create temporary table user_by_genre_totalrating as select userid, genre, sum(rating) total_rating from movie_by_user_genre group by userid, genre; </br>
109+
create temporary table movie_by_user_genre as select t1.*, t2.rating,t2.userid from movie_by_genre t1 left join ratings t2 on t1.movieid = t2.movieid where t2.rating >= 4;
110+
111+
create temporary table user_by_genre_totalrating as select userid, genre, sum(rating) total_rating from movie_by_user_genre group by userid, genre;
112+
126113
select * from
127-
(select userid, genre, row_number() over (partition by userid order by total_rating desc) row_num from user_by_genre_totalrating) t where t.row_num <= 3
114+
(select userid, genre, row_number() over (partition by userid order by total_rating desc) row_num from user_by_genre_totalrating) t where t.row_num <= 3;
128115
</CODE>
129116
</br>
130117

0 commit comments

Comments
 (0)